gpt4 book ai didi

hadoop - Spark :What is the ideal number of reducers

转载 作者:可可西里 更新时间:2023-11-01 15:11:11 27 4
gpt4 key购买 nike

我的数据大约是300G。如果我使用 Hadoop 对其执行 reduce 作业,180 个 reduce 插槽就可以了,队列中没有任务等待。

如果我使用具有相同数量的 reduce 槽的 Spark 执行此操作,它会在洗牌阶段卡住,而如果我使用更多的槽(比如 4000)就不会发生这种情况,但这将以低效率结束。

有什么我可以做的,比如调整参数,以便我可以使用与 hadoop 相同的插槽?

顺便说一句,我的集群有 15 个节点,每个节点有 12 个核心

最佳答案

Shuffle Operation in Hadoop and Spark是关于该主题的好读物。一些引述:

Each map task in Spark writes out a shuffle file (operating system disk buffer) for every reducer – this corresponds to a logical block in Spark. These files are not intermediary in the sense that Spark does not merge them into larger partitioned ones. Since scheduling overhead in Spark is much lesser, the no. of mappers (M) and reducers(R) is far higher than in Hadoop. Thus, shipping M*R files to the respective reducers could result in significant overheads.

A major difference between Hadoop and Spark is on the reducer side – Spark requires all shuffled data to fit into memory of the corresponding reducer task (we saw that Hadoop had an option to spill this over to disk).

It does look like Hadoop shuffle is much more optimized compared to Spark’s shuffle from the discussion so far. However, this was the case and researchers have made significant optimizations to Spark w.r.t. the shuffle operation. The two possible approaches are 1. to emulate Hadoop behavior by merging intermediate files 2. To create larger shuffle files 3. Use columnar compression to shift bottleneck to CPU.

Optimizing Shuffle Performance in Spark得出了类似的结论:

By identifying the shuffle phase bottlenecks specific to Spark,we have explored several alternatives to mitigate the operatingsystem overheads associated with these bottlenecks. The most fruitful of which is shuffle file consolidation, asimple solution that led to a 2x improvement in overall jobcompletion time.

所以你看,Hadoop/YARN 不能直接与 Spark 进行比较,特别是在 shuffle 和 reduce 方面。与 Hadoop 不同,Spark 需要特定的优化技术。您的情况究竟需要什么很难猜测。但我的印象是,您只是略过问题的表面,简单地调整 Spark 中的 reducer 数量并不能解决问题。

关于hadoop - Spark :What is the ideal number of reducers,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/39118499/

27 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com