gpt4 book ai didi

hadoop - 为什么增加 reducer 的数量会增加运行减速阶段的时间?

转载 作者:可可西里 更新时间:2023-11-01 15:11:08 29 4
gpt4 key购买 nike

我今天在 AWS 上使用不同数量的 reducer 运行我的 Hadoop 程序,但是我观察到随着 reducer 数量的增加,时间没有减少,而是增加了。对于时间,我是说从Map 100%,Reduce 30%到Map 100%,Reduce 100%

最佳答案

请记住,数据需要通过网络发送到 reducer,如果您从 mapper 输出的数据不是很大以增加 reducer 的数量可能会影响性能,因为结果需要传输到不同的 reducer,由于每个 reducer 创建自己的文件,您需要创建更多文件,因此 I/O 操作会增加。

每个reduce都需要启动并在节点中创建/实例化,这导致启动时间增加。此外,数据需要在所有 reducer 之间进行拆分,这需要更多的网络传输时间和解析时间。

此外,如果您不使用 reducer,最好将其数量设置为零,因为 Hadoop 无需担心创建它们,整个过程会更快

引用自yahoo developer

The efficiency of reduces is driven by a large extent by the performance of the shuffle.

The number of reduces configured for the application (r) is, obviously, a crucial factor.

Having too many or too few reduces is anti-productive:

Too few reduces cause undue load on the node on which the reduce is scheduled — in extreme cases, we have seen reduces processing over 100GB per-reduce. This also leads to very bad failure-recovery scenarios, since a single failed reduce, has a significant, adverse, impact on the latency of the job.

Too many reduces adversely affects the shuffle crossbar. Also, in extreme cases it results in too many small files created as the output of the job — this hurts both the NameNode and performance of subsequent Map-Reduce applications who need to process lots of small files.

关于hadoop - 为什么增加 reducer 的数量会增加运行减速阶段的时间?,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/39541718/

29 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com