gpt4 book ai didi

performance - Hadoop 性能

转载 作者:可可西里 更新时间:2023-11-01 14:20:23 25 4
gpt4 key购买 nike

我安装了 hadoop 1.0.0 并试用了字数统计示例(单节点集群)。完成需要 2 分钟 48 秒。然后我尝试了标准的 linux 字数统计程序,它在同一组(180 kB 数据)上运行 10 毫秒。我做错了什么,还是 Hadoop 非常非常慢?

time hadoop jar /usr/share/hadoop/hadoop*examples*.jar wordcount someinput someoutput
12/01/29 23:04:41 INFO input.FileInputFormat: Total input paths to process : 30
12/01/29 23:04:41 INFO mapred.JobClient: Running job: job_201201292302_0001
12/01/29 23:04:42 INFO mapred.JobClient: map 0% reduce 0%
12/01/29 23:05:05 INFO mapred.JobClient: map 6% reduce 0%
12/01/29 23:05:15 INFO mapred.JobClient: map 13% reduce 0%
12/01/29 23:05:25 INFO mapred.JobClient: map 16% reduce 0%
12/01/29 23:05:27 INFO mapred.JobClient: map 20% reduce 0%
12/01/29 23:05:28 INFO mapred.JobClient: map 20% reduce 4%
12/01/29 23:05:34 INFO mapred.JobClient: map 20% reduce 5%
12/01/29 23:05:35 INFO mapred.JobClient: map 23% reduce 5%
12/01/29 23:05:36 INFO mapred.JobClient: map 26% reduce 5%
12/01/29 23:05:41 INFO mapred.JobClient: map 26% reduce 8%
12/01/29 23:05:44 INFO mapred.JobClient: map 33% reduce 8%
12/01/29 23:05:53 INFO mapred.JobClient: map 36% reduce 11%
12/01/29 23:05:54 INFO mapred.JobClient: map 40% reduce 11%
12/01/29 23:05:56 INFO mapred.JobClient: map 40% reduce 12%
12/01/29 23:06:01 INFO mapred.JobClient: map 43% reduce 12%
12/01/29 23:06:02 INFO mapred.JobClient: map 46% reduce 12%
12/01/29 23:06:06 INFO mapred.JobClient: map 46% reduce 14%
12/01/29 23:06:09 INFO mapred.JobClient: map 46% reduce 15%
12/01/29 23:06:11 INFO mapred.JobClient: map 50% reduce 15%
12/01/29 23:06:12 INFO mapred.JobClient: map 53% reduce 15%
12/01/29 23:06:20 INFO mapred.JobClient: map 56% reduce 15%
12/01/29 23:06:21 INFO mapred.JobClient: map 60% reduce 17%
12/01/29 23:06:28 INFO mapred.JobClient: map 63% reduce 17%
12/01/29 23:06:29 INFO mapred.JobClient: map 66% reduce 17%
12/01/29 23:06:30 INFO mapred.JobClient: map 66% reduce 20%
12/01/29 23:06:36 INFO mapred.JobClient: map 70% reduce 22%
12/01/29 23:06:37 INFO mapred.JobClient: map 73% reduce 22%
12/01/29 23:06:45 INFO mapred.JobClient: map 80% reduce 24%
12/01/29 23:06:51 INFO mapred.JobClient: map 80% reduce 25%
12/01/29 23:06:54 INFO mapred.JobClient: map 86% reduce 25%
12/01/29 23:06:55 INFO mapred.JobClient: map 86% reduce 26%
12/01/29 23:07:02 INFO mapred.JobClient: map 90% reduce 26%
12/01/29 23:07:03 INFO mapred.JobClient: map 93% reduce 26%
12/01/29 23:07:07 INFO mapred.JobClient: map 93% reduce 30%
12/01/29 23:07:09 INFO mapred.JobClient: map 96% reduce 30%
12/01/29 23:07:10 INFO mapred.JobClient: map 96% reduce 31%
12/01/29 23:07:12 INFO mapred.JobClient: map 100% reduce 31%
12/01/29 23:07:22 INFO mapred.JobClient: map 100% reduce 100%
12/01/29 23:07:28 INFO mapred.JobClient: Job complete: job_201201292302_0001
12/01/29 23:07:28 INFO mapred.JobClient: Counters: 29
12/01/29 23:07:28 INFO mapred.JobClient: Job Counters
12/01/29 23:07:28 INFO mapred.JobClient: Launched reduce tasks=1
12/01/29 23:07:28 INFO mapred.JobClient: SLOTS_MILLIS_MAPS=275346
12/01/29 23:07:28 INFO mapred.JobClient: Total time spent by all reduces waiting after reserving slots (ms)=0
12/01/29 23:07:28 INFO mapred.JobClient: Total time spent by all maps waiting after reserving slots (ms)=0
12/01/29 23:07:28 INFO mapred.JobClient: Launched map tasks=30
12/01/29 23:07:28 INFO mapred.JobClient: Data-local map tasks=30
12/01/29 23:07:28 INFO mapred.JobClient: SLOTS_MILLIS_REDUCES=137186
12/01/29 23:07:28 INFO mapred.JobClient: File Output Format Counters
12/01/29 23:07:28 INFO mapred.JobClient: Bytes Written=26287
12/01/29 23:07:28 INFO mapred.JobClient: FileSystemCounters
12/01/29 23:07:28 INFO mapred.JobClient: FILE_BYTES_READ=71510
12/01/29 23:07:28 INFO mapred.JobClient: HDFS_BYTES_READ=89916
12/01/29 23:07:28 INFO mapred.JobClient: FILE_BYTES_WRITTEN=956282
12/01/29 23:07:28 INFO mapred.JobClient: HDFS_BYTES_WRITTEN=26287
12/01/29 23:07:28 INFO mapred.JobClient: File Input Format Counters
12/01/29 23:07:28 INFO mapred.JobClient: Bytes Read=85860
12/01/29 23:07:28 INFO mapred.JobClient: Map-Reduce Framework
12/01/29 23:07:28 INFO mapred.JobClient: Map output materialized bytes=71684
12/01/29 23:07:28 INFO mapred.JobClient: Map input records=2574
12/01/29 23:07:28 INFO mapred.JobClient: Reduce shuffle bytes=71684
12/01/29 23:07:28 INFO mapred.JobClient: Spilled Records=6696
12/01/29 23:07:28 INFO mapred.JobClient: Map output bytes=118288
12/01/29 23:07:28 INFO mapred.JobClient: CPU time spent (ms)=39330
12/01/29 23:07:28 INFO mapred.JobClient: Total committed heap usage (bytes)=5029167104
12/01/29 23:07:28 INFO mapred.JobClient: Combine input records=8233
12/01/29 23:07:28 INFO mapred.JobClient: SPLIT_RAW_BYTES=4056
12/01/29 23:07:28 INFO mapred.JobClient: Reduce input records=3348
12/01/29 23:07:28 INFO mapred.JobClient: Reduce input groups=1265
12/01/29 23:07:28 INFO mapred.JobClient: Combine output records=3348
12/01/29 23:07:28 INFO mapred.JobClient: Physical memory (bytes) snapshot=4936278016
12/01/29 23:07:28 INFO mapred.JobClient: Reduce output records=1265
12/01/29 23:07:28 INFO mapred.JobClient: Virtual memory (bytes) snapshot=26102546432
12/01/29 23:07:28 INFO mapred.JobClient: Map output records=8233

real 2m48.886s
user 0m3.300s
sys 0m0.304s


time wc someinput/*
178 1001 8674 someinput/capacity-scheduler.xml
178 1001 8674 someinput/capacity-scheduler.xml.bak
7 7 196 someinput/commons-logging.properties
7 7 196 someinput/commons-logging.properties.bak
24 35 535 someinput/configuration.xsl
80 122 1968 someinput/core-site.xml
80 122 1972 someinput/core-site.xml.bak
1 0 1 someinput/dfs.exclude
1 0 1 someinput/dfs.include
12 36 327 someinput/fair-scheduler.xml
45 192 2141 someinput/hadoop-env.sh
45 192 2139 someinput/hadoop-env.sh.bak
20 137 910 someinput/hadoop-metrics2.properties
20 137 910 someinput/hadoop-metrics2.properties.bak
118 582 4653 someinput/hadoop-policy.xml
118 582 4653 someinput/hadoop-policy.xml.bak
241 623 6616 someinput/hdfs-site.xml
241 623 6630 someinput/hdfs-site.xml.bak
171 417 6177 someinput/log4j.properties
171 417 6177 someinput/log4j.properties.bak
1 0 1 someinput/mapred.exclude
1 0 1 someinput/mapred.include
12 15 298 someinput/mapred-queue-acls.xml
12 15 298 someinput/mapred-queue-acls.xml.bak
338 897 9616 someinput/mapred-site.xml
338 897 9630 someinput/mapred-site.xml.bak
1 1 10 someinput/masters
1 1 18 someinput/slaves
57 89 1243 someinput/ssl-client.xml.example
55 85 1195 someinput/ssl-server.xml.example
2574 8233 85860 total

real 0m0.009s
user 0m0.004s
sys 0m0.000s

最佳答案

这取决于很多因素,包括您的配置、您的机器、内存配置、JVM 设置等。您还需要减去 JVM 启动时间。

它对我来说运行得更快。也就是说,在小数据集上它当然会比专用的 C 程序慢——考虑一下它在“幕后”所做的事情。

对分布在数千个文件中的 TB 数据进行尝试,看看会发生什么。

关于performance - Hadoop 性能,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/9057348/

25 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com