gpt4 book ai didi

hadoop - Hive (Hadoop) 中的 COLLECT_SET()

转载 作者:可可西里 更新时间:2023-11-01 14:50:47 26 4
gpt4 key购买 nike

我刚刚了解了 Hive 中的 collect_set() 函数,并开始从事开发 3 节点集群的工作。

我只有大约 10 GB 需要处理。然而,这项工作确实需要永远。我认为 collect_set() 的实现中可能存在错误,我的代码中存在错误,或者 collect_set() 函数确实是资源密集型的。

这是我的 Hive SQL(没有双关语意):

INSERT OVERWRITE TABLE sequence_result_1
SELECT sess.session_key as session_key,
sess.remote_address as remote_address,
sess.hit_count as hit_count,
COLLECT_SET(evt.event_id) as event_set,
hit.rsp_timestamp as hit_timestamp,
sess.site_link as site_link
FROM site_session sess
JOIN (SELECT * FROM site_event
WHERE event_id = 274 OR event_id = 284 OR event_id = 55 OR event_id = 151) evt
ON (sess.session_key = evt.session_key)
JOIN site_hit hit ON (sess.session_key = evt.session_key)
GROUP BY sess.session_key, sess.remote_address, sess.hit_count, hit.rsp_timestamp, sess.site_link
ORDER BY hit_timestamp;

有 4 个 MR passes。第一次花了大约 30 秒。第二张 map 用了大约 1 分钟。第二次减少的大部分时间大约需要 2 分钟。在过去的两个小时里,它从 97.71% 增加到 97.73%。这是正确的吗?我想一定有什么问题。看了一下日志,不知道是否正常。

[日志示例]

2011-06-21 16:32:22,715 WARN org.apache.hadoop.hive.ql.exec.GroupByOperator: Hash Tbl flush: #hash table = 120894
2011-06-21 16:32:22,758 WARN org.apache.hadoop.hive.ql.exec.GroupByOperator: Hash Table flushed: new size = 108804
2011-06-21 16:32:23,003 INFO org.apache.hadoop.hive.ql.exec.JoinOperator: 4 forwarding 5142000000 rows
2011-06-21 16:32:23,003 INFO org.apache.hadoop.hive.ql.exec.SelectOperator: 5 forwarding 5142000000 rows
2011-06-21 16:32:24,138 INFO org.apache.hadoop.hive.ql.exec.JoinOperator: 4 forwarding 5143000000 rows
2011-06-21 16:32:24,138 INFO org.apache.hadoop.hive.ql.exec.SelectOperator: 5 forwarding 5143000000 rows
2011-06-21 16:32:24,725 WARN org.apache.hadoop.hive.ql.exec.GroupByOperator: Hash Tbl flush: #hash table = 120894
2011-06-21 16:32:24,768 INFO org.apache.hadoop.hive.ql.exec.GroupByOperator: 6 forwarding 42000000 rows
2011-06-21 16:32:24,771 WARN org.apache.hadoop.hive.ql.exec.GroupByOperator: Hash Table flushed: new size = 108804
2011-06-21 16:32:25,338 INFO org.apache.hadoop.hive.ql.exec.JoinOperator: 4 forwarding 5144000000 rows
2011-06-21 16:32:25,338 INFO org.apache.hadoop.hive.ql.exec.SelectOperator: 5 forwarding 5144000000 rows
2011-06-21 16:32:26,467 INFO org.apache.hadoop.hive.ql.exec.JoinOperator: 4 forwarding 5145000000 rows
2011-06-21 16:32:26,468 INFO org.apache.hadoop.hive.ql.exec.SelectOperator: 5 forwarding 5145000000 rows

我在这方面还很陌生,尝试使用 collect_set() 和 Hive Array 使我陷入困境。

提前致谢:)

最佳答案

重大失败。我的解决方案如下。毕竟 COLLECT_SET 没有问题,它只是试图收集所有项目,其中有无限项。

为什么?因为我加入了一些甚至不属于系列的东西。第二次加入曾经是相同的 ON 条件,现在它正确地说 hit.session_key = evt.session_key

INSERT OVERWRITE TABLE sequence_result_1
SELECT sess.session_key as session_key,
sess.remote_address as remote_address,
sess.hit_count as hit_count,
COLLECT_SET(evt.event_id) as event_set,
hit.rsp_timestamp as hit_timestamp,
sess.site_link as site_link
FROM tealeaf_session sess
JOIN site_event evt ON (sess.session_key = evt.session_key)
JOIN site_hit hit ON (sess.session_key = hit.session_key)
WHERE evt.event_id IN(274,284,55,151)
GROUP BY sess.session_key, sess.remote_address, sess.hit_count, hit.rsp_timestamp, sess.site_link
ORDER BY hit_timestamp;

关于hadoop - Hive (Hadoop) 中的 COLLECT_SET(),我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/6433338/

26 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com