gpt4 book ai didi

Hive sql 上映射步骤的 Java 堆大小内存

转载 作者:太空宇宙 更新时间:2023-11-04 13:01:48 25 4
gpt4 key购买 nike

我运行以下 hql:

select new.uid as uid, new.category_id as category_id, new.atag as atag,
new.rank_idx + CASE when old.rank_idx is not NULL then old.rank_idx else 0 END as rank_idx
from (
select a1.uid, a1.category_id, a1.atag, row_number() over(distribute by a1.uid, a1.category_id sort by a1.cmt_time) as rank_idx from (
select app.uid,
CONCAT(cast(app.knowledge_point_id_list[0] as string),'#',cast(app.type_id as string)) as category_id,
app.atag as atag, app.cmt_time as cmt_time
from model.mdl_psr_app_behavior_question_result app
where app.subject = 'english'
and app.dt = '2016-01-14'
and app.cmt_timelen > 1000
and app.cmt_timelen < 120000
) a1
) new
left join (
select uid, category_id, rank_idx from model.mdl_psr_mlc_app_count_last
where subject = 'english'
and dt = '2016-01-13'
) old
on new.uid = old.uid
and new.category_id = old.category_id

最初 mdl_psr_mlc_app_count_last 和 mdl_psr_mlc_app_count_day 存储为 JsonSerde,查询运行。

我的同事认为 JsonSerde 效率极低,而且占用空间太大。 Parquet 对我来说是更好的选择。

当我这样做时,查询中断并显示以下错误日志:

org.apache.hadoop.hive.ql.exec.mr.ExecMapper: ExecMapper: processing 1 rows: used memory = 1024506232 2016-01-19 16:36:56,119 INFO [main] org.apache.hadoop.hive.ql.exec.mr.ExecMapper: ExecMapper: processing 10 rows: used memory = 1024506232 2016-01-19 16:36:56,130 INFO [main] org.apache.hadoop.hive.ql.exec.mr.ExecMapper: ExecMapper: processing 100 rows: used memory = 1024506232 2016-01-19 16:36:56,248 INFO [main] org.apache.hadoop.hive.ql.exec.mr.ExecMapper: ExecMapper: processing 1000 rows: used memory = 1035075896 2016-01-19 16:36:56,694 INFO [main] org.apache.hadoop.hive.ql.exec.mr.ExecMapper: ExecMapper: processing 10000 rows: used memory = 1045645560 2016-01-19 16:36:57,056 INFO [main] org.apache.hadoop.hive.ql.exec.mr.ExecMapper: ExecMapper: processing 100000 rows: used memory = 1065353232

看起来像是java内存问题。有人建议我尝试一下:

SET mapred.child.java.opts=-Xmx900m;
SET mapreduce.reduce.memory.mb=8048;
SET mapreduce.reduce.java.opts='-Xmx8048M';
SET mapreduce.map.memory.mb=1024;
set mapreduce.map.java.opts='-Xmx4096M';
set mapred.child.map.java.opts='-Xmx4096M';

它仍然中断,并显示相同的错误消息。现在有人建议:

SET mapred.child.java.opts=-Xmx900m;
SET mapreduce.reduce.memory.mb=1024;
SET mapreduce.reduce.java.opts='-Xmx1024M';
SET mapreduce.map.memory.mb=1024;
set mapreduce.map.java.opts='-Xmx1024M';
set mapreduce.child.map.java.opts='-Xmx1024M';
set mapred.reduce.tasks = 40;

现在它可以正常运行了。

有人可以解释一下为什么吗?

==================================顺便说一句:虽然它运行了,但减少步骤非常慢。当你这样做时,你能解释一下为什么吗?

最佳答案

由于某种原因,YARN 对 parquet 格式的支持很差。

报价Mapr

For example, if the MapReduce job sorts parquet files, Mapper needs to cache the whole Parquet row group in memory. I have done tests to prove that the larger the row group size of parquet files is, the larger Mapper memory is needed. In this case, make sure the Mapper memory is large enough without triggering OOM.

我不太清楚为什么问题中的不同设置很重要,但简单的解决方案是放弃 parquet 并使用 orc。一点性能损失换取无错误。

关于Hive sql 上映射步骤的 Java 堆大小内存,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/34873037/

25 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com