gpt4 book ai didi

apache-spark - HIVE:插入查询失败,错误为 “java.lang.OutOfMemoryError: GC overhead limit exceeded”

转载 作者:行者123 更新时间:2023-12-02 20:22:08 33 4
gpt4 key购买 nike

我的Hive插入查询因以下错误而失败:
java.lang.OutOfMemoryError:超出了GC开销限制

表2中的数据= 1.7tb
查询:

set hive.exec.dynamic.partition.mode= nonstrict;set hive.exec.dynamic.partition=true;set mapreduce.map.memory.mb=15000;set mapreduce.map.java.opts=-Xmx9000m;set mapreduce.reduce.memory.mb=15000;set mapreduce.reduce.java.opts=-Xmx9000m;set hive.rpc.query.plan=true;
insert into database1.table1 PARTITION(trans_date) select * from database1.table2;

Error info: Launching Job 1 out of 1 Number of reduce tasks is set to 0 since there's no reduce operator FAILED: Execution Error, return code -101 from org.apache.hadoop.hive.ql.exec.mr.MapRedTask. GC overhead limit exceeded

cluster info : total memory : 1.2TB total vcores :288 total nodes : 8 node version : 2.7.0-mapr-1808



请注意:
我正在尝试将数据以拼花格式从表2插入到ORC格式的表1中。
数据大小总计为1.8TB。

最佳答案

添加按分区键分发可以解决问题:

insert into database1.table1 PARTITION(trans_date) select * from database1.table2
distribute by trans_date;
distribute by trans_date将触发reducer步骤,并且每个reducer将处理单个分区,这将减少内存压力。当每个进程写入多个分区时,它在内存中为ORC保留了太多缓冲区。

还可以考虑添加此设置来控制每个化简器将处理多少数据:
set hive.exec.reducers.bytes.per.reducer=67108864; --this is example only, reduce the figure to increase parallelism

关于apache-spark - HIVE:插入查询失败,错误为 “java.lang.OutOfMemoryError: GC overhead limit exceeded”,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/59765693/

33 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com