gpt4 book ai didi

hadoop - HQL引发的ArrayList无法强制转换为org.apache.hadoop.io.Text

转载 作者:行者123 更新时间:2023-12-02 19:53:27 28 4
gpt4 key购买 nike

我有一个查询,当减少时失败,抛出的错误是:

Error: Error while processing statement: FAILED: Execution Error, return code 2 from org.apache.hadoop.hive.ql.exec.mr.MapRedTask (state=08S01,code=2)


但是,当深入了解YARN日志时,我能够找到以下内容:

Error: java.lang.RuntimeException: org.apache.hadoop.hive.ql.metadata.HiveException: Hive Runtime Error while processing row (tag=0) {"key":{"reducesinkkey0":"2020-05-05","reducesinkkey1":10039,"reducesinkkey2":103,"reducesinkkey3":"2020-05-05","reducesinkkey4":10039,"reducesinkkey5":103},"value":{"_col0":103,"_col1":["1","2"]}} at org.apache.hadoop.hive.ql.exec.mr.ExecReducer.reduce(ExecReducer.java:265) at org.apache.hadoop.mapred.ReduceTask.runOldReducer(ReduceTask.java:444) at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:392) at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:164) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:422) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1920) at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:158) Caused by: org.apache.hadoop.hive.ql.metadata.HiveException: Hive Runtime Error while processing row (tag=0) {"key":{"reducesinkkey0":"2020-05-05","reducesinkkey1":10039,"reducesinkkey2":103,"reducesinkkey3":"2020-05-05","reducesinkkey4":10039,"reducesinkkey5":103},"value":{"_col0":103,"_col1":["1","2"]}} at org.apache.hadoop.hive.ql.exec.mr.ExecReducer.reduce(ExecReducer.java:253) ... 7 more Caused by: java.lang.ClassCastException: java.util.ArrayList cannot be cast to org.apache.hadoop.io.Text


最相关的部分是:

java.util.ArrayList cannot be cast to org.apache.hadoop.io.Text


这是我正在执行的查询(仅供引用:这是较大查询中的子查询):
SELECT
yyyy_mm_dd,
h_id,
MAX(CASE WHEN rn=1 THEN prov_id ELSE NULL END) OVER (partition by yyyy_mm_dd, h_id) as primary_prov,
collect_set(api) OVER (partition by yyyy_mm_dd, h_id, p_id) prov_id_api, --re-assemple array to include all elements from multiple initial arrays if there are different arrays per prov_id
prov_id
FROM(
SELECT --get "primary prov" (first element in ascending array))
yyyy_mm_dd,
h_id,
prov_id,
api,
ROW_NUMBER() OVER(PARTITION BY yyyy_mm_dd, h_id ORDER BY api) rn
FROM(
SELECT --explode array to get data at row level
t.yyyy_mm_dd,
t.h_id,
prov_id,
collect_set(--array of integers, use set to remove duplicates
CASE
WHEN e.apis_xml_element = 'res' THEN 1
WHEN e.apis_xml_element = 'av' THEN 2
...
...
ELSE e.apis_xml_element
END) as api
FROM
mytable t
LATERAL VIEW EXPLODE(apis_xml) e AS apis_xml_element
WHERE
yyyy_mm_dd = "2020-05-05"
AND t.apis_xml IS NOT NULL
GROUP BY
1,2,3
)s
)s
我进一步将问题缩小到顶级选择,因为内部选择本身可以正常工作,这使我相信问题在这里特别发生:

collect_set(api) OVER (partition by yyyy_mm_dd, h_id, prov_id) prov_id_api


但是,我不确定如何解决。在最内部的选择中, apis_xml是一个 array<string>,它将诸如'res'和'av'之类的字符串保留到一个点。然后使用整数。因此,案例陈述使这些一致。
奇怪的是,如果我通过Spark来运行它,即 spark.sql=(above_query),它就可以工作。但是,在通过HQL的直线上,工作被杀死了。

最佳答案

在内部查询中删除collect_set,因为它已经产生了数组,所以上面的collect_set应该接收标量。还要在内部查询中删除group by,因为没有collect_set,就不再有聚合。如果需要删除重复项,可以使用DISTINCT

关于hadoop - HQL引发的ArrayList无法强制转换为org.apache.hadoop.io.Text,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/62514737/

28 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com