gpt4 book ai didi

hadoop - 理解在 Hadoop 中合并到 reduce 端

转载 作者:可可西里 更新时间:2023-11-01 14:22:30 28 4
gpt4 key购买 nike

我对 Hadoop 中 reduce 端的文件合并过程的理解有问题,因为它在“Hadoop:权威指南”(Tom White)中有所描述。引用它:

When all the map outputs have been copied, the reduce task moves into the sort phase (which should properly be called the merge phase, as the sorting was carried out on the map side), which merges the map outputs, maintaining their sort ordering. This is done in rounds. For example, if there were 50 map outputs and the merge factor was 10 (the default, controlled by the io.sort.factor property, just like in the map’s merge), there would be five rounds. Each round would merge 10 files into one, so at the end there would be five intermediate files. Rather than have a final round that merges these five files into a single sorted file, the merge saves a trip to disk by directly feeding the reduce function in what is the last phase: the reduce phase. This final merge can come from a mixture of in-memory and on-disk segments.

The number of files merged in each round is actually more subtle than this example suggests. The goal is to merge the minimum number of files to get to the merge factor for the final round. So if there were 40 files, the merge would not merge 10 files in each of the four rounds to get 4 files. Instead, the first round would merge only 4 files, and the subsequent three rounds would merge the full 10 files. The 4 merged files and the 6 (as yet unmerged) files make a total of 10 files for the final round. The process is illustrated in Figure 6-7. Note that this does not change the number of rounds; it’s just an opti- mization to minimize the amount of data that is written to disk, since the final round always merges directly into the reduce.

在第二个示例(有 40 个文件)中,我们真正得到了最后一轮的合并因子。第五轮有10个文件没有写入磁盘,直接去reduce。但在第一个示例中,实际上有 6 轮,而不是 5 轮。在前五轮的每一轮中,10 个文件被合并并写入磁盘,然后在第 6 轮中,我们有 5 个文件(不是 10 个!)直接去 reduce。为什么?如果坚持“目标是合并最少数量的文件以达到最后一轮的合并因子”,那么对于这 50 个文件,我们必须在第一轮合并 5 个文件,然后在随后的 4 轮中每轮合并 10 个文件,并且然后我们在最后的第 6 轮合并因子为 10。

请注意,我们不能在每一轮中合并超过 10 个文件(由 io.sort.factor 为这两个示例指定)。

第一个合并了 50 个文件的例子我理解错了什么?

最佳答案

我是这么理解的。如果你仔细阅读,要记住的重要一点是:

Note that this does not change the number of rounds; it’s just an optimization to minimize the amount of data that is written to disk, since the final round always merges directly into the reduce.

有或没有优化,合并轮数保持不变(第一种情况下为 5,第二种情况下为 4)。

  • 第一种情况:将 50 个文件合并为最后 5 个,然后将它们直接送入“reduce”阶段(总轮数为 5 + 1 = 6)
  • 第二种情况:34 个文件合并为最后 4 个,其余 6 个直接从内存中读取并送入“减少”阶段(总轮数为 4 + 1 = 5)

在这两种情况下,合并轮数由设置为 10 的配置 mapreduce.task.io.sort.factor 决定。

因此合并轮数不会改变(无论优化是否完成)。但是,每一轮合并的文件数量可能会发生变化(因为 Hadoop 框架可以引入一些优化来减少合并数量,从而减少溢出到磁盘的数量)。

因此,在第一种情况下,没有优化50 个文件(合并为最后 5 个文件)的内容被溢出到磁盘并从磁盘读取这些文件,在“减少”阶段。

在第二种情况下,经过优化,34 个文件(合并为最后 4 个文件)的内容溢出到磁盘,这些文件从磁盘读取,其余 6 个未合并的文件直接从磁盘读取内存缓冲区,在“减少”阶段。

优化的思想是尽量减少合并和溢出。

关于hadoop - 理解在 Hadoop 中合并到 reduce 端,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/26015678/

28 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com