gpt4 book ai didi

python - 无法执行基于 Python 的 Hadoop Streaming 作业

转载 作者:可可西里 更新时间:2023-11-01 16:14:10 25 4
gpt4 key购买 nike

我有一个 5 节点的 hadoop 集群,我可以在其上成功执行以下流作业

 sudo -u hdfs hadoop jar /usr/lib/hadoop-0.20-mapreduce/contrib/streaming/hadoop-streaming.jar -input /sample/apat63_99.txt -output /foo1 -mapper 'wc -l' -numReduceTasks 0

但是当我尝试使用 python 执行流作业时

sudo -u hdfs hadoop jar /usr/lib/hadoop-0.20-mapreduce/contrib/streaming/hadoop-streaming.jar -input /sample/apat63_99.txt -output /foo5 -mapper 'AttributeMax.py 8' -file '/tmp/AttributeMax.py' -numReduceTasks 1

我得到一个错误

packageJobJar: [/tmp/AttributeMax.py, /tmp/hadoop-hdfs/hadoop-unjar2062240123197790813/] [] /tmp/streamjob4074525553604040275.jar tmpDir=null
14/08/29 11:22:58 WARN mapred.JobClient: Use GenericOptionsParser for parsing the arguments. Applications should implement Tool for the same.
14/08/29 11:22:58 INFO mapred.FileInputFormat: Total input paths to process : 1
14/08/29 11:22:59 INFO streaming.StreamJob: getLocalDirs(): [/tmp/hadoop-hdfs/mapred/local]
14/08/29 11:22:59 INFO streaming.StreamJob: Running job: job_201408272304_0030
14/08/29 11:22:59 INFO streaming.StreamJob: To kill this job, run:
14/08/29 11:22:59 INFO streaming.StreamJob: UNDEF/bin/hadoop job -Dmapred.job.tracker=jt1:8021 -kill job_201408272304_0030
14/08/29 11:22:59 INFO streaming.StreamJob: Tracking URL: http://jt1:50030/jobdetails.jsp?jobid=job_201408272304_0030
14/08/29 11:23:00 INFO streaming.StreamJob: map 0% reduce 0%
14/08/29 11:23:46 INFO streaming.StreamJob: map 100% reduce 100%
14/08/29 11:23:46 INFO streaming.StreamJob: To kill this job, run:
14/08/29 11:23:46 INFO streaming.StreamJob: UNDEF/bin/hadoop job -Dmapred.job.tracker=jt1:8021 -kill job_201408272304_0030
14/08/29 11:23:46 INFO streaming.StreamJob: Tracking URL: http://jt1:50030/jobdetails.jsp?jobid=job_201408272304_0030
14/08/29 11:23:46 ERROR streaming.StreamJob: Job not successful. Error: NA
14/08/29 11:23:46 INFO streaming.StreamJob: killJob...

在我的作业跟踪器控制台中,我看到了错误

java.io.IOException: log:null
R/W/S=2359/0/0 in:NA [rec/s] out:NA [rec/s]
minRecWrittenToEnableSkip_=9223372036854775807 LOGNAME=null
HOST=null
USER=mapred
HADOOP_USER=null
last Hadoop input: |null|
last tool output: |null|
Date: Fri Aug 29 11:22:43 CDT 2014
java.io.IOException: Broken pipe
at java.io.FileOutputStream.writeBytes(Native Method)
at java.io.FileOutputStream.write(FileOutputStream.java:282)
at java.io.BufferedOutputStream.write(BufferedOutputStream.java:105)
at java.io.BufferedOutputStream.flushBuffer(BufferedOutputStream.java:65)
at java.io.BufferedOutputStream.write(BufferedOutputStream.java:109)
at java.io.DataOutputStream.write(DataOutputStream.java:90)
at org.apache.hadoop.streaming.io.TextInputWriter.writeUTF8(TextInputWriter.java:72)
at org.apache.hadoop.streaming.io.TextInputWriter.writeValue(TextInputWriter.java:51)
at org.apache.hadoop.streaming.PipeMapper.map(PipeMapper.java:110)
at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:50)
at org.apache.hadoop.streaming.Pipe

python代码本身很简单

#!/usr/bin/env python
import sys
index = int(sys.argv[1])
max = 0
for line in sys.stdin
fields = line.strip().split(",")
if fields[index].isdigit():
val = int(fields[index])
if (val > max):
max = val
else:
print max

最佳答案

我自己解决了这个问题。我还必须在映射器中指定“python”

sudo -u hdfs hadoop jar /usr/lib/hadoop-0.20-mapreduce/contrib/streaming/hadoop-streaming.jar 
-input /sample/cite75_99.txt
-output /foo
-mapper **'python RandomSample.py 10'**
-file RandomSale.py
-numReduceTasks 1

关于python - 无法执行基于 Python 的 Hadoop Streaming 作业,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/25586655/

25 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com