gpt4 book ai didi

hadoop - 由于在映射端聚合中使用 HashMap 而导致内存不足

转载 作者:可可西里 更新时间:2023-11-01 15:17:39 25 4
gpt4 key购买 nike

我的 Hive 查询抛出此异常。

Hadoop job information for Stage-1: number of mappers: 6; number of reducers: 1
2013-05-22 12:08:32,634 Stage-1 map = 0%, reduce = 0%
2013-05-22 12:09:19,984 Stage-1 map = 100%, reduce = 100%
Ended Job = job_201305221200_0001 with errors
Error during job, obtaining debugging information...
Examining task ID: task_201305221200_0001_m_000007 (and more) from job job_201305221200_0001
Examining task ID: task_201305221200_0001_m_000003 (and more) from job job_201305221200_0001
Examining task ID: task_201305221200_0001_m_000001 (and more) from job job_201305221200_0001

Task with the most failures(4):
-----
Task ID:
task_201305221200_0001_m_000001

URL:
http://ip-10-134-7-119.ap-southeast-1.compute.internal:9100/taskdetails.jsp?jobid=job_201305221200_0001&tipid=task_201305221200_0001_m_000001

Possible error:
Out of memory due to hash maps used in map-side aggregation.

Solution:
Currently hive.map.aggr.hash.percentmemory is set to 0.5. Try setting it to a lower value. i.e 'set hive.map.aggr.hash.percentmemory = 0.25;'
-----

Counters:
FAILED: Execution Error, return code 2 from org.apache.hadoop.hive.ql.exec.MapRedTask


select
uri,
count(*) as hits
from
iislog
where
substr(cs_cookie,instr(cs_Cookie,'cwc'),30) like '%CWC%'
and uri like '%.aspx%'
and logdate = '2013-02-07'
group by uri
order by hits Desc;

我在 8 个 EMR 核心实例和 1 个 8Gb 数据的大型主实例上进行了尝试。首先,我尝试使用外部表(数据位置是 s3 的路径)。之后我将数据从 S3 下载到 EMR 并使用 native 配置单元表。但是在他们两个中我都遇到了同样的错误。

FYI, i am using regex serde to parse the iislogs.

'org.apache.hadoop.hive.contrib.serde2.RegexSerDe'
WITH SERDEPROPERTIES (
"input.regex" ="([0-9-]+) ([^ ]*) ([^ ]*) ([^ ]*) ([^ ]*) ([^ ]*) ([^ ]*) ([^ ]*) (\".*\"|[^ ]*) ([^ ]*) ([^ ]*) ([^ ]*) ([^ ]*) ([^ ]*) (\".*\"|[^ ]*) ([^ ]*) ([^ ]*) ([^ ]*) ([^ ]*) ([^ ]*) ([^ ]*) ([^ ]*) (\".*\"|[^ ]*) ([^ ]*) ([^ ]*) ([^ ]*) ([^ ]*) ([0-9-]+ [0-9:.]+) ([^ ]*) ([^ ]*) (\".*\"|[^ ]*) ([0-9-]+ [0-9:.]+)",
"output.format.string"="%1$s %2$s %3$s %4$s %5$s %6$s %7$s %8$s %9$s %10$s %11$s %12$s %13$s %14$s %15$s %16$s %17$s %18$s %19$s %20$s %21$s %22$s %23$s %24$s %25$s %26$s %27$s %28$s %29$s %30$s %31$s %32$s")
location 's3://*******';

最佳答案

  • 表的位置对 Hive 无关紧要。
  • 如果您可以粘贴查询会更好 - 这样就可以弄清楚映射器是否也在排序。

    无论如何 - 我们需要增加内存量。检查映射任务配置为使用多少内存运行(mapred.child ...)。至少应该在1G左右。如果足够大,您可以:

    • 如果映射器未排序:考虑将日志中指示的哈希聚合内存百分比提高到更高的数字
    • 如果映射器正在排序 - 只需将任务内存增加到一个更大的数字即可。

关于hadoop - 由于在映射端聚合中使用 HashMap 而导致内存不足,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/16684712/

25 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com