gpt4 book ai didi

hadoop - 配置 Hadoop 以将输入文件作为一个映射任务处理

转载 作者:可可西里 更新时间:2023-11-01 16:15:23 25 4
gpt4 key购买 nike

我正在使用一个 200MB 的文件执行 MapReduce。我的目标是完成 1 个 map task 。我做了:

Configuration conf = new Configuration();
conf.set("mapred.min.split.size","999999999999999");

但是,似乎记录的数量限制了我。是 split map task 的原因吗?如果是这样,我可以做些什么来改变它?

14/03/20 00:12:04 INFO mapred.MapTask: data buffer = 79691776/99614720
14/03/20 00:12:04 INFO mapred.MapTask: record buffer = 262144/327680
14/03/20 00:12:05 INFO mapred.MapTask: Spilling map output: record full = true

最佳答案

mapred.min.split.size 通常构成创建输入拆分的下限,而 DFS block 大小为 128MB。因此,在您的情况下,下限大于上限,而且 hadoop 似乎并不关心这一点,而是采用上限并相应地拆分输入数据。

引用自维基:

Actually controlling the number of maps is subtle. The mapred.map.tasks parameter is just a hint to the InputFormat for the number of maps. The default InputFormat behavior is to split the total number of bytes into the right number of fragments. However, in the default case the DFS block size of the input files is treated as an upper bound for input splits. A lower bound on the split size can be set via mapred.min.split.size. Thus, if you expect 10TB of input data and have 128MB DFS blocks, you'll end up with 82k maps, unless your mapred.map.tasks is even larger. Ultimately the InputFormat determines the number of maps.

给你的提示在最后一句,所以如果你想控制映射器的数量,你必须覆盖InputFormat,一般我们使用FileInputFormat,它是isSplittable() 方法需要被覆盖以返回 false。这将确保每个文件有一个映射器。像下面这样的东西就足够了:

Class NonSplittableFileInputFormat extends FileInputFormat{

@Override
public boolean isSplitable(FileSystem fs, Path filename){
return false;
}
}

关于hadoop - 配置 Hadoop 以将输入文件作为一个映射任务处理,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/22512037/

25 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com