gpt4 book ai didi

java - 设置带有输入拆分的映射器的Hadoop数量不起作用

转载 作者:行者123 更新时间:2023-12-02 21:09:38 29 4
gpt4 key购买 nike

我正在尝试使用不同数量的mapper和reducer多次运行hadoop作业。我已经设置了配置:

  • mapreduce.input.fileinputformat.split.maxsize
  • mapreduce.input.fileinputformat.split.minsize
  • mapreduce.job.maps


我的文件大小为1160421275,当我尝试在此代码中使用4个映射器和3个reducer配置它时:
Configuration conf = new Configuration();
FileSystem hdfs = FileSystem.get(conf);
long size = hdfs.getContentSummary(new Path("input/filea").getLength();
size+=hdfs.getContentSummary(new Path("input/fileb").getLength();
conf.set("mapreduce.input.fileinputformat.split.maxsize", String.valueOf((size/4)));
conf.set("mapreduce.input.fileinputformat.split.minsize", String.valueOf((size/4)));
conf.set("mapreduce.job.maps",4);
....
job.setNumReduceTask(3);

size / 4表示290105318。作业的执行给出以下输出:
2016-11-19 12:30:36,426 INFO  [main] input.FileInputFormat (FileInputFormat.java:listStatus(287)) - Total input paths to process : 1
2016-11-19 12:30:36,535 INFO [main] input.FileInputFormat (FileInputFormat.java:listStatus(287)) - Total input paths to process : 4
2016-11-19 12:30:36,572 INFO [main] mapreduce.JobSubmitter (JobSubmitter.java:submitJobInternal(396)) - number of splits:7

分割数为7,而不是4,成功作业的输出为:
File System Counters
FILE: Number of bytes read=18855390277
FILE: Number of bytes written=14653469965
FILE: Number of read operations=0
FILE: Number of large read operations=0
FILE: Number of write operations=0
Map-Reduce Framework
Map input records=39184416
Map output records=36751473
Map output bytes=787022241
Map output materialized bytes=860525313
Input split bytes=1801
Combine input records=0
Combine output records=0
Reduce input groups=25064998
Reduce shuffle bytes=860525313
Reduce input records=36751473
Reduce output records=1953960
Spilled Records=110254419
Shuffled Maps =21
Failed Shuffles=0
Merged Map outputs=21
GC time elapsed (ms)=1124
CPU time spent (ms)=0
Physical memory (bytes) snapshot=0
Virtual memory (bytes) snapshot=0
Total committed heap usage (bytes)=6126829568
Shuffle Errors
BAD_ID=0
CONNECTION=0
IO_ERROR=0
WRONG_LENGTH=0
WRONG_MAP=0
WRONG_REDUCE=0
File Input Format Counters
Bytes Read=0
File Output Format Counters
Bytes Written=77643084

该 map 显示它处理了21张混洗的 map 。我希望它只处理4个映射器。对于reducer,它给出的文件总数正确为3,而我的mapper拆分大小设置是否错误?

最佳答案

我相信您正在使用TextInputFormat。

  • 如果您有多个文件,则每个文件将至少产生一个映射器。如果文件大小(不是累积大小,而是单个文件大小)大于块大小(已通过设置min和max进行了调整),则会再次生成更多的映射器。
  • 尝试使用CombineTextInputFormat,这将帮助您实现所需的功能,但可能仍然不完全是4。
  • 查看要用来确定要生成多少个映射器的InputFormat的逻辑。
  • 关于java - 设置带有输入拆分的映射器的Hadoop数量不起作用,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/40689601/

    29 4 0
    Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
    广告合作:1813099741@qq.com 6ren.com