gpt4 book ai didi

performance - hadoop YARN性能:在群集上运行wordCount示例非常慢

转载 作者:行者123 更新时间:2023-12-02 21:51:13 27 4
gpt4 key购买 nike

最近,我设置了Hadoop集群进行测试,该集群有两个用于执行任务的节点,并且基于Yarn。

我知道Hadoop不适合用于示例,它在非常大的数据级别上具有良好的性能,但是仍然太慢。我的意思是非常慢。我的输入文件是一个500,000字的文档,减少的数量是2。

这是日志:

 hadoop jar /home/hadoop/hadoopTest.jar  com.hadoop.WordCountJob /wordcountest /wordcountresult

Job started: Mon Dec 23 12:38:13 CST 2013
13/12/23 12:38:13 INFO service.AbstractService: Service:org.apache.hadoop.yarn.client.YarnClientImpl is inited.
13/12/23 12:38:14 INFO service.AbstractService: Service:org.apache.hadoop.yarn.client.YarnClientImpl is started.
13/12/23 12:38:14 WARN mapreduce.JobSubmitter: Use GenericOptionsParser for parsing the arguments. Applications should implement Tool for the same.
13/12/23 12:38:27 INFO input.FileInputFormat: Total input paths to process : 1
13/12/23 12:38:27 INFO mapreduce.JobSubmitter: number of splits:1
13/12/23 12:38:27 WARN conf.Configuration: mapred.jar is deprecated. Instead, use mapreduce.job.jar
13/12/23 12:38:27 WARN conf.Configuration: mapred.reduce.tasks is deprecated. Instead, use mapreduce.job.reduces
13/12/23 12:38:27 WARN conf.Configuration: mapred.output.value.class is deprecated. Instead, use mapreduce.job.output.value.class
13/12/23 12:38:27 WARN conf.Configuration: mapreduce.map.class is deprecated. Instead, use mapreduce.job.map.class
13/12/23 12:38:27 WARN conf.Configuration: mapred.job.name is deprecated. Instead, use mapreduce.job.name
13/12/23 12:38:27 WARN conf.Configuration: mapreduce.reduce.class is deprecated. Instead, use mapreduce.job.reduce.class
13/12/23 12:38:27 WARN conf.Configuration: mapreduce.inputformat.class is deprecated. Instead, use mapreduce.job.inputformat.class
13/12/23 12:38:27 WARN conf.Configuration: mapred.input.dir is deprecated. Instead, use mapreduce.input.fileinputformat.inputdir
13/12/23 12:38:27 WARN conf.Configuration: mapred.output.dir is deprecated. Instead, use mapreduce.output.fileoutputformat.outputdir
13/12/23 12:38:27 WARN conf.Configuration: mapreduce.outputformat.class is deprecated. Instead, use mapreduce.job.outputformat.class
13/12/23 12:38:27 WARN conf.Configuration: mapred.map.tasks is deprecated. Instead, use mapreduce.job.maps
13/12/23 12:38:27 WARN conf.Configuration: mapred.output.key.class is deprecated. Instead, use mapreduce.job.output.key.class
13/12/23 12:38:27 WARN conf.Configuration: mapred.working.dir is deprecated. Instead, use mapreduce.job.working.dir
13/12/23 12:38:29 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_1383617275312_0021
13/12/23 12:38:30 INFO client.YarnClientImpl: Submitted application application_1383617275312_0021 to ResourceManager at Hadoop1/111.11.11.11:8032
13/12/23 12:38:30 INFO mapreduce.Job: The url to track the job: http://kmHadoop1:8088/proxy/application_1383617275312_0021/
13/12/23 12:38:30 INFO mapreduce.Job: Running job: job_1383617275312_0021
13/12/23 12:43:22 INFO mapreduce.Job: Job job_1383617275312_0021 running in uber mode : false
13/12/23 12:43:22 INFO mapreduce.Job: map 0% reduce 0%
13/12/23 13:03:37 INFO mapreduce.Job: map 67% reduce 0%
13/12/23 13:03:43 INFO mapreduce.Job: map 100% reduce 0%
13/12/23 13:07:04 INFO mapreduce.Job: map 100% reduce 37%
13/12/23 13:07:07 INFO mapreduce.Job: map 100% reduce 51%
13/12/23 13:07:10 INFO mapreduce.Job: map 100% reduce 67%
13/12/23 13:07:51 INFO mapreduce.Job: map 100% reduce 69%
13/12/23 13:07:52 INFO mapreduce.Job: map 100% reduce 70%
13/12/23 13:07:54 INFO mapreduce.Job: map 100% reduce 85%
13/12/23 13:07:54 INFO mapreduce.Job: map 100% reduce 100%
13/12/23 13:07:54 INFO mapreduce.Job: Job job_1383617275312_0021 completed successfully
13/12/23 13:07:55 INFO mapreduce.Job: Counters: 43
File System Counters
FILE: Number of bytes read=519233
FILE: Number of bytes written=1254635
FILE: Number of read operations=0
FILE: Number of large read operations=0
FILE: Number of write operations=0
HDFS: Number of bytes read=2356520
HDFS: Number of bytes written=427594
HDFS: Number of read operations=9
HDFS: Number of large read operations=0
HDFS: Number of write operations=4
Job Counters
Launched map tasks=1
Launched reduce tasks=2
Data-local map tasks=1
Total time spent by all maps in occupied slots (ms)=1225928
Total time spent by all reduces in occupied slots (ms)=495508
Map-Reduce Framework
Map input records=8646
Map output records=420146
Map output bytes=4187027
Map output materialized bytes=519225
Input split bytes=122
Combine input records=0
Combine output records=0
Reduce input groups=35430
Reduce shuffle bytes=519225
Reduce input records=420146
Reduce output records=35430
Spilled Records=840292
Shuffled Maps =2
Failed Shuffles=0
Merged Map outputs=2
GC time elapsed (ms)=263996
CPU time spent (ms)=222750
Physical memory (bytes) snapshot=529215488
Virtual memory (bytes) snapshot=4047876096
Total committed heap usage (bytes)=479268864
Shuffle Errors
BAD_ID=0
CONNECTION=0
IO_ERROR=0
WRONG_LENGTH=0
WRONG_MAP=0
WRONG_REDUCE=0
File Input Format Counters
Bytes Read=2356398
File Output Format Counters
Bytes Written=427594
Job ended: Mon Dec 23 13:07:55 CST 2013
The job took 1782 seconds.

我们可以在日志的每一行之前看到时间戳。

似乎每个步骤都很慢:初始化,检查输入路径,在Yarn上启动,Mapreduce等。

整个过程花费了1783秒。
发生了什么 ?我做错什么了吗 ?

我的hadoop版本是CDH4.3.0,该集群有2个节点。 Hdfs中有成千上万个小文件,这有问题吗?

最佳答案

我从您的输出中看到

Map output bytes=4187027
Map output materialized bytes=519225
...

您正在(至少)对中间图输出数据进行压缩。您可以尝试在关闭压缩的情况下重新运行示例;众所周知,GZIP压缩会增加计算机处理器的负担。也许在关闭压缩之前,您可能会考虑监视CPU负载以验证这确实是瓶颈。

运行GZIP压缩功能的2或3个节点的群集时,我看到的工作时间过长。随着您开始添加节点,这种情况会改变。当我将该群集扩展到最多10个节点并重新执行相同的工作时,压缩实际上变得非常有益(相对于不使用压缩,100 GB Terasort的总体工作时间提高了约40%)。

关于performance - hadoop YARN性能:在群集上运行wordCount示例非常慢,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/20737620/

27 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com