gpt4 book ai didi

xml - 在Hive XML SerDe中使用 “Attribute to Attribute”映射

转载 作者:行者123 更新时间:2023-12-02 20:57:25 33 4
gpt4 key购买 nike

我有一个看起来像下面的XML文档:

<root>
<unwanted>
...
</unwanted>
<wanted version="A">
<unwanted2 type='1'>
...
</unwanted2>
<unwanted2 type='2'>
...
</unwanted2>
<unwanted2 type='3'>
...
</unwanted2>
<wanted2>
<detail>
<row date="Jan-17" price="100" inventory="50">
<row date="Feb-17" price="101" inventory="40">
<row date="Mar-17" price="102" inventory="30">
</detail>
</wanted2>
</wanted>
<wanted version="B">
<unwanted2 type='1'>
...
</unwanted2>
<unwanted2 type='2'>
...
</unwanted2>
<unwanted2 type='3'>
...
</unwanted2>
<wanted2>
<detail>
<row date="Jan-17" price="200" inventory="60">
<row date="Feb-17" price="201" inventory="70">
<row date="Mar-17" price="202" inventory="80">
</detail>
</wanted2>
</wanted>
</root>

我想将文件导入 Hive表,最好是这种格式:
Version | Date   | Price | Inventory
A Jan-17 100 50
A Feb-17 101 40
A Mar-17 102 30
B Jan-17 200 60
B Feb-17 201 70
B Mar-17 202 80

但是我现在同意将其作为日期和价格的 map 导入:
version | spot_date
A {Date: Jan-17, Price: 100, Inventory: 50}
A {Date: Feb-17, ...}
A {Date: Mar-17, ...}
B {Date: Jan-17, ...}
B {Date: Feb-17, ...}
B {Date: Mar-17, ...}

我正在尝试使用 XMLSerDe for Hive,并使用“ attribute to attribute”功能。

我的查询如下所示:
CREATE EXTERNAL TABLE ppa_test(
version STRING,
spot_date MAP<STRING,STRING>
)
ROW FORMAT SERDE 'com.ibm.spss.hive.serde2.xml.XmlSerDe'
WITH SERDEPROPERTIES (
"column.xpath.version"="/wanted/@version",
"column.xpath.spot_date"="/wanted/wanted2/detail/row",
"xml.map.specification.row"="date->@date"
)
STORED AS
INPUTFORMAT 'com.ibm.spss.hive.serde2.xml.XmlInputFormat'
OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.IgnoreKeyTextOutputFormat'
TBLPROPERTIES (
"xmlinput.start"="<wanted ",
"xmlinput.end"="</wanted>"
);

但是,当我加载数据时,我得到:
version | spot_date
A {"row":"Mar-17"}
B {"row":"Mar-17"}

如果我改为将 xml.map.spec路径更改为:
"xml.map.specification.row"="@date->@price"

我可以分别读取XML的每一行,但是它记录在同一Hive表行中,而且我更喜欢使用属性名称:
Version | spot_date
A {"Mar-17":"102", "Feb-17":"101", "Jan-17":"100"}
B {"Mar-17":"202", "Feb-17":"201", "Jan-17":"200"}
  • 如何将每个XML row节点记录到其自己的Hive记录
  • 如何使用属性名称(或自定义字符串)作为键?


  • 编辑

    因此,从 spot_date MAP<STRING,STRING>更改为...
    CREATE EXTERNAL TABLE ppa_test(
    scenario STRING,
    spot_date array<struct<
    date: string,
    price: string,
    inventory: string,
    >>
    )...

    给我一系列对象
    Version | spot_date
    A [{date: Jan-17, price: 100, inventory: 50},
    {date: Feb-17, price: 101, inventory: 40},
    {date: Mar-17, price: 102, inventory: 30}]
    B [{date: Jan-17, ... ]

    从上面完成#2,但仍然不确定#1

    最佳答案

    您可以分解为#2创建的结构数组以得到#1。

    CREATE EXTERNAL TABLE ppa_test(
    scenario STRING,
    spot_date ARRAY<STRUCT<spotdates: struct<
    date: string,
    price: string,
    inventory: string,
    >>>
    )

    您可以为此使用横向 View
    DROP TABLE IF EXISTS ppa_test_exploded;
    CREATE TABLE ppa_test_exploded as
    SELECT scenario,
    SD.spotdates.date as date,
    SD.spotdates.price as price,
    SD.spotdates.inventory as inventory
    FROM ppa_test
    LATERAL VIEW EXPLODE(spot_date) exploded as SD;

    希望这可以帮助。

    关于xml - 在Hive XML SerDe中使用 “Attribute to Attribute”映射,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/44167560/

    33 4 0
    Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
    广告合作:1813099741@qq.com 6ren.com