gpt4 book ai didi

hadoop - s3distcp 在显示 100% 后挂起

转载 作者:可可西里 更新时间:2023-11-01 14:24:01 26 4
gpt4 key购买 nike

为了尝试解决 performance issues使用 Amazon EMR,我尝试使用 s3distcp 将文件从 S3 复制到我的 EMR 集群以进行本地处理。作为第一个测试,我从一个目录复制一天的数据,2160 个文件,使用 --groupBy 选项将它们折叠成一个(或几个)文件。

工作似乎运行得很好,向我展示了 map/reduce 进度到 100%,但此时进程挂起并且再也没有回来。我怎样才能弄清楚发生了什么?

源文件是存储在S3中的GZipped文本文件,每个大约30kb。这是一个普通的 Amazon EMR 集群,我从主节点的 shell 运行 s3distcp。

hadoop@ip-xxx:~$ hadoop jar /home/hadoop/lib/emr-s3distcp-1.0.jar --src s3n://xxx/click/20140520 --dest hdfs:////data/click/20140520 --groupBy ".*(20140520).*" --outputCodec lzo
14/05/21 20:06:32 INFO s3distcp.S3DistCp: Running with args: [Ljava.lang.String;@26f3bbad
14/05/21 20:06:35 INFO s3distcp.S3DistCp: Using output path 'hdfs:/tmp/9f423c59-ec3a-465e-8632-ae449d45411a/output'
14/05/21 20:06:35 INFO s3distcp.S3DistCp: GET http://169.254.169.254/latest/meta-data/placement/availability-zone result: us-west-2b
14/05/21 20:06:35 INFO s3distcp.S3DistCp: Created AmazonS3Client with conf KeyId AKIAJ5KT6QSV666K6KHA
14/05/21 20:06:37 INFO s3distcp.FileInfoListing: Opening new file: hdfs:/tmp/9f423c59-ec3a-465e-8632-ae449d45411a/files/1
14/05/21 20:06:38 INFO s3distcp.S3DistCp: Created 1 files to copy 2160 files
14/05/21 20:06:38 INFO mapred.JobClient: Default number of map tasks: null
14/05/21 20:06:38 INFO mapred.JobClient: Setting default number of map tasks based on cluster size to : 72
14/05/21 20:06:38 INFO mapred.JobClient: Default number of reduce tasks: 3
14/05/21 20:06:39 INFO security.ShellBasedUnixGroupsMapping: add hadoop to shell userGroupsCache
14/05/21 20:06:39 INFO mapred.JobClient: Setting group to hadoop
14/05/21 20:06:39 INFO mapred.FileInputFormat: Total input paths to process : 1
14/05/21 20:06:39 INFO mapred.JobClient: Running job: job_201405211343_0031
14/05/21 20:06:40 INFO mapred.JobClient: map 0% reduce 0%
14/05/21 20:06:53 INFO mapred.JobClient: map 1% reduce 0%
14/05/21 20:06:56 INFO mapred.JobClient: map 4% reduce 0%
14/05/21 20:06:59 INFO mapred.JobClient: map 36% reduce 0%
14/05/21 20:07:00 INFO mapred.JobClient: map 44% reduce 0%
14/05/21 20:07:02 INFO mapred.JobClient: map 54% reduce 0%
14/05/21 20:07:05 INFO mapred.JobClient: map 86% reduce 0%
14/05/21 20:07:06 INFO mapred.JobClient: map 94% reduce 0%
14/05/21 20:07:08 INFO mapred.JobClient: map 100% reduce 10%
14/05/21 20:07:11 INFO mapred.JobClient: map 100% reduce 19%
14/05/21 20:07:14 INFO mapred.JobClient: map 100% reduce 27%
14/05/21 20:07:17 INFO mapred.JobClient: map 100% reduce 29%
14/05/21 20:07:20 INFO mapred.JobClient: map 100% reduce 100%
[hangs here]

作业显示为:

hadoop@xxx:~$ hadoop job -list
1 job currently running
JobId State StartTime UserName Priority SchedulingInfo
job_201405211343_0031 1 1400702799339 hadoop NORMAL NA

目标 HDFS 目录中没有任何内容:

hadoop@xxx:~$ hadoop dfs -ls /data/click/

有什么想法吗?

最佳答案

hadoop@ip-xxx:~$ hadoop jar/home/hadoop/lib/emr-s3distcp-1.0.jar --src s3n://xxx/click/20140520**/** --dest hdfs:////data/click/20140520**/** --groupBy ".(20140520)."--outputCodec lzo

我遇到了类似的问题。我所需要的只是在目录末尾放置一个额外的斜线。因此,它完成并显示了统计数据,之前它卡在 100%

关于hadoop - s3distcp 在显示 100% 后挂起,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/23793026/

26 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com