gpt4 book ai didi

hadoop - Xml与Hive解析

转载 作者:行者123 更新时间:2023-12-02 21:35:08 26 4
gpt4 key购买 nike

我正在使用 hive serDe(https://github.com/dvasilen/Hive-XML-SerDe/wiki/XML-data-sources)进行XML解析并将其加载到 hive 中。
示例XML内容:

<records>
<record customer_id="0000-JTALA">
<income>200000</income>
<address type="M">
<Flatno>345</FlatNo>
<Street>ABS</street>
<city>QWW</city>
<country>US</country>
<pin>3235</pin>
</address>
<address type="B">
<Street>ABS</street>
<city>QWW</city>
<country>US</country>
<pin>3235</pin>
</address>
</record>

<record customer_id="0001-JTALA">
<income>200000</income>
<address type="M">
<Flatno>45</FlatNo>
<Street>fgBS</street>
<city>QWW</city>
<country>US</country>
<pin>3235</pin>
</address>
<address type="B">
<Street>ABS</street>
<city>QWW</city>
<country>US</country>
<pin>325</pin>
</address>
<address type="P">
<Street>ABS</street>
<city>QWW</city>
<country>UK</country>
<pin>325</pin>
</address>
</record>
</records>

对于每个地址,应创建一行。根据上面的示例,应该为第一位客户创建2条记录,为第二位客户创建3条记录,总共5条记录,按照我当前的代码,仅为单个客户创建了两条记录,并且在地址列中将所有地址连接在一起,因此对于第一个客户街道列,具有(第一个地址街道+第二个地址街道)。
查询样例:
CREATE external TABLE msg_details(customer_id STRING, income BIGINT, AType String,Flatno String, Street string,city string,country string,pin string)
ROW FORMAT SERDE 'com.ibm.spss.hive.serde2.xml.XmlSerDe'
WITH SERDEPROPERTIES (
"column.xpath.customer_id"="/record/@customer_id",
"column.xpath.income"="/record/income/text()",
"column.xpath.address_type"="/record/address/@type",
"column.xpath.Flatno"="/record/address/Flatno/text()",
"column.xpath.Street"="/record/address/Street/text()",
"column.xpath.city"="/record/address/city/text()",
"column.xpath.country"="/record/address/country/text()"
"column.xpath.pin"="/record/address/pin/text()"
)
STORED AS
INPUTFORMAT 'com.ibm.spss.hive.serde2.xml.XmlInputFormat'
OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.IgnoreKeyTextOutputFormat'
location '/user/root/serdeinput'
TBLPROPERTIES (
"xmlinput.start"="<record customer",
"xmlinput.end"="</record>"
);

最佳答案

一种方法是编写用于XML解析的用户定义(自定义)服务。
[ 要么 ]
编写UDF,以将包含在同一列中的Array值拆分为行。

您使用的Serde是通用的Serde,它几乎等同于Hive Serde提供的xpath,它们都具有仅提取记录的有限功能。

我使用横向 View 和其他方法尝试了3种其他方法,但不适用于地址类型中的所有列。

唯一的解决方案是继续使用自定义Serde来根据您的要求进行解析。

create external table msg_details3(customer_id string, income bigint, address_type Array<string>,Flatno Array<string>, Street ARRAY<string>,city ARRAY<string>,country ARRAY<string>,pin ARRAY<string>)
ROW FORMAT SERDE 'com.ibm.spss.hive.serde2.xml.XmlSerDe'
WITH SERDEPROPERTIES (
"column.xpath.customer_id"="/record/@customer_id",
"column.xpath.income"="/record/income/text()",
"column.xpath.address_type"="/record/address/@type",
"column.xpath.Flatno"="/record/address/Flatno/text()",
"column.xpath.Street"="/record/address/Street/text()",
"column.xpath.city"="/record/address/city/text()",
"column.xpath.country"="/record/address/country/text()",
"column.xpath.pin"="/record/address/pin/text()"
)
STORED AS
INPUTFORMAT 'com.ibm.spss.hive.serde2.xml.XmlInputFormat'
OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.IgnoreKeyTextOutputFormat'
location '/user/cloudera/data'
TBLPROPERTIES (
"xmlinput.start"="<record ",
"xmlinput.end"="</record>"
);

关于hadoop - Xml与Hive解析,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/32846235/

26 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com