gpt4 book ai didi

python - Dataproc Hadoop MapReduce-无法正常工作

转载 作者:行者123 更新时间:2023-12-02 18:34:33 26 4
gpt4 key购买 nike

我基本上是在尝试运行我的第一个Hadoop MapReduce例程,并且我必须使用Hadoop和MapReduce,因为我正在为一个类项目进行此操作。我想将Python用于映射器和化简器,因为我对这种语言最熟悉,并且对同龄人最熟悉。我觉得最简单的设置方法是通过Google DataProc实例,因此我也可以运行它。我将描述我的工作以及所使用的资源,但是我对此还比较陌生,可能会丢失一些东西。

Dataproc配置

Dataproc 1

Dataproc 2

Dataproc 3

然后,我可以通过SSH进入我的主节点。我将mapper.pyreducer.py文件存储在Google Cloud Storage存储桶中。

映射器和化简器代码来自this Micheal Noll blog post,经过修改可与Python 3配合使用。

mapper.py:

#!/usr/bin/env python
"""mapper.py"""

import sys

# input comes from STDIN (standard input)
for line in sys.stdin:
# remove leading and trailing whitespace
line = line.strip()
# split the line into words
words = line.split()
# increase counters
# increase counters
for word in words:
# write the results to STDOUT (standard output);
# what we output here will be the input for the
# Reduce step, i.e. the input for reducer.py
#
# tab-delimited; the trivial word count is 1
#print ('%s\t%s' % (word, 1))
print(f"{word}\t{1}")

reducer.py

#!/usr/bin/env python
"""reducer.py"""

from operator import itemgetter
import sys

print_out = lambda x, y: print(f'{x}\t{y}')

current_word = None
current_count = 0
word = None

# input comes from STDIN (standard input)
for line in sys.stdin:
# remove leading and trailing whitespace
line = line.strip()

# parse the input we got from mapper.py
word, count = line.split('\t', 1)

# convert count (currently a string) to int
try:
count = int(count)
except ValueError:
# count was not a number, so silently
# ignore/discard this line
continue
#print("still working")

# this IF-switch only works because Hadoop sorts map output
# by key (here: word) before it is passed to the reducer
if current_word == word:
current_count += count
else:
if current_word:
# write result to STDOUT
#print '%s\t%s' % (current_word, current_count)
print_out(current_word, current_count)
current_count = count
current_word = word

# do not forget to output the last word if needed!
if current_word == word:
#print '%s\t%s' % (current_word, current_count)
print_out(current_word, current_count)

最后,我将ssh放入主节点,然后检查python版本:
hduser@data-604-m:~$ python
Python 3.7.3 (default, Mar 27 2019, 22:11:17)
[GCC 7.3.0] :: Anaconda, Inc. on linux
Type "help", "copyright", "credits" or "license" for more information.
>>>

然后运行以下命令(改编自 here):
hadoop jar /usr/lib/hadoop-mapreduce/hadoop-streaming.jar \
-files gs://data-604-hadoop/mapper.py,gs://data-604-hadoop/reducer.py \
-mapper mapper.py \
-reducer reducer.py \
-input gs://data-604-hadoop/books/pg20417.txt \
-output gs://data-604-hadoop/output

结果如下:
hduser@data-604-m:~$ hadoop jar /usr/lib/hadoop-mapreduce/hadoop-streaming.jar     -files gs://data-604-hadoop/mapper.py,gs://data-604-hadoop/reducer.py     -map
per mapper.py -reducer reducer.py -input gs://data-604-hadoop/books/pg20417.txt -output gs://data-604-hadoop/output
packageJobJar: [] [/usr/lib/hadoop-mapreduce/hadoop-streaming-2.9.2.jar] /tmp/streamjob4601880105330541890.jar tmpDir=null
19/11/12 02:10:46 INFO client.RMProxy: Connecting to ResourceManager at data-604-m/10.162.0.13:8032
19/11/12 02:10:47 INFO client.AHSProxy: Connecting to Application History server at data-604-m/10.162.0.13:10200
19/11/12 02:10:47 INFO client.RMProxy: Connecting to ResourceManager at data-604-m/10.162.0.13:8032
19/11/12 02:10:47 INFO client.AHSProxy: Connecting to Application History server at data-604-m/10.162.0.13:10200
19/11/12 02:10:49 INFO mapred.FileInputFormat: Total input files to process : 1
19/11/12 02:10:49 INFO mapreduce.JobSubmitter: number of splits:15
19/11/12 02:10:49 INFO Configuration.deprecation: yarn.resourcemanager.system-metrics-publisher.enabled is deprecated. Instead, use yarn.system-metrics-publisher
.enabled
19/11/12 02:10:49 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_1573523684358_0002
19/11/12 02:10:50 INFO impl.YarnClientImpl: Submitted application application_1573523684358_0002
19/11/12 02:10:50 INFO mapreduce.Job: The url to track the job: http://data-604-m:8088/proxy/application_1573523684358_0002/
19/11/12 02:10:50 INFO mapreduce.Job: Running job: job_1573523684358_0002
19/11/12 02:10:58 INFO mapreduce.Job: Job job_1573523684358_0002 running in uber mode : false
19/11/12 02:10:58 INFO mapreduce.Job: map 0% reduce 0%
19/11/12 02:11:10 INFO mapreduce.Job: Task Id : attempt_1573523684358_0002_m_000000_0, Status : FAILED
Error: java.lang.RuntimeException: PipeMapRed.waitOutputThreads(): subprocess failed with code 1
at org.apache.hadoop.streaming.PipeMapRed.waitOutputThreads(PipeMapRed.java:325)
at org.apache.hadoop.streaming.PipeMapRed.mapRedFinished(PipeMapRed.java:538)
at org.apache.hadoop.streaming.PipeMapper.close(PipeMapper.java:130)
at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:61)
at org.apache.hadoop.streaming.PipeMapRunner.run(PipeMapRunner.java:34)
at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:458)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:342)
at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:177)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:422)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1893)
at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:171)
19/11/12 02:11:10 INFO mapreduce.Job: Task Id : attempt_1573523684358_0002_m_000001_0, Status : FAILED
Error: java.lang.RuntimeException: PipeMapRed.waitOutputThreads(): subprocess failed with code 1
at org.apache.hadoop.streaming.PipeMapRed.waitOutputThreads(PipeMapRed.java:325)
at org.apache.hadoop.streaming.PipeMapRed.mapRedFinished(PipeMapRed.java:538)
at org.apache.hadoop.streaming.PipeMapper.close(PipeMapper.java:130)
at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:61)
at org.apache.hadoop.streaming.PipeMapRunner.run(PipeMapRunner.java:34)
at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:458)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:342)
at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:177)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:422)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1893)
at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:171)
19/11/12 02:11:12 INFO mapreduce.Job: Task Id : attempt_1573523684358_0002_m_000002_0, Status : FAILED
Error: java.lang.RuntimeException: PipeMapRed.waitOutputThreads(): subprocess failed with code 1
at org.apache.hadoop.streaming.PipeMapRed.waitOutputThreads(PipeMapRed.java:325)
at org.apache.hadoop.streaming.PipeMapRed.mapRedFinished(PipeMapRed.java:538)
at org.apache.hadoop.streaming.PipeMapper.close(PipeMapper.java:130)
at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:61)
at org.apache.hadoop.streaming.PipeMapRunner.run(PipeMapRunner.java:34)
at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:458)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:342)
at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:177)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:422)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1893)
at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:171)
19/11/12 02:11:12 INFO mapreduce.Job: Task Id : attempt_1573523684358_0002_m_000004_0, Status : FAILED
Error: java.lang.RuntimeException: PipeMapRed.waitOutputThreads(): subprocess failed with code 1
at org.apache.hadoop.streaming.PipeMapRed.waitOutputThreads(PipeMapRed.java:325)
at org.apache.hadoop.streaming.PipeMapRed.mapRedFinished(PipeMapRed.java:538)
at org.apache.hadoop.streaming.PipeMapper.close(PipeMapper.java:130)
at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:61)
at org.apache.hadoop.streaming.PipeMapRunner.run(PipeMapRunner.java:34)
at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:458)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:342)
at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:177)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:422)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1893)
at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:171)
19/11/12 02:11:12 INFO mapreduce.Job: Task Id : attempt_1573523684358_0002_m_000003_0, Status : FAILED
Error: java.lang.RuntimeException: PipeMapRed.waitOutputThreads(): subprocess failed with code 1
at org.apache.hadoop.streaming.PipeMapRed.waitOutputThreads(PipeMapRed.java:325)
at org.apache.hadoop.streaming.PipeMapRed.mapRedFinished(PipeMapRed.java:538)
at org.apache.hadoop.streaming.PipeMapper.close(PipeMapper.java:130)
at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:61)
at org.apache.hadoop.streaming.PipeMapRunner.run(PipeMapRunner.java:34)
at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:458)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:342)
at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:177)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:422)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1893)
at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:171)
19/11/12 02:11:19 INFO mapreduce.Job: Task Id : attempt_1573523684358_0002_m_000000_1, Status : FAILED
Error: java.lang.RuntimeException: PipeMapRed.waitOutputThreads(): subprocess failed with code 1
at org.apache.hadoop.streaming.PipeMapRed.waitOutputThreads(PipeMapRed.java:325)
at org.apache.hadoop.streaming.PipeMapRed.mapRedFinished(PipeMapRed.java:538)
at org.apache.hadoop.streaming.PipeMapper.close(PipeMapper.java:130)
at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:61)
at org.apache.hadoop.streaming.PipeMapRunner.run(PipeMapRunner.java:34)
at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:458)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:342)
at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:177)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:422)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1893)
at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:171)
19/11/12 02:11:20 INFO mapreduce.Job: Task Id : attempt_1573523684358_0002_m_000001_1, Status : FAILED
Error: java.lang.RuntimeException: PipeMapRed.waitOutputThreads(): subprocess failed with code 1
at org.apache.hadoop.streaming.PipeMapRed.waitOutputThreads(PipeMapRed.java:325)
at org.apache.hadoop.streaming.PipeMapRed.mapRedFinished(PipeMapRed.java:538)
at org.apache.hadoop.streaming.PipeMapper.close(PipeMapper.java:130)
at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:61)
at org.apache.hadoop.streaming.PipeMapRunner.run(PipeMapRunner.java:34)
at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:458)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:342)
at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:177)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:422)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1893)
at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:171)
19/11/12 02:11:24 INFO mapreduce.Job: Task Id : attempt_1573523684358_0002_m_000005_0, Status : FAILED
Error: java.lang.RuntimeException: PipeMapRed.waitOutputThreads(): subprocess failed with code 1
at org.apache.hadoop.streaming.PipeMapRed.waitOutputThreads(PipeMapRed.java:325)
at org.apache.hadoop.streaming.PipeMapRed.mapRedFinished(PipeMapRed.java:538)
at org.apache.hadoop.streaming.PipeMapper.close(PipeMapper.java:130)
at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:61)
at org.apache.hadoop.streaming.PipeMapRunner.run(PipeMapRunner.java:34)
at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:458)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:342)
at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:177)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:422)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1893)
at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:171)
19/11/12 02:11:24 INFO mapreduce.Job: Task Id : attempt_1573523684358_0002_m_000006_0, Status : FAILED
Error: java.lang.RuntimeException: PipeMapRed.waitOutputThreads(): subprocess failed with code 1
at org.apache.hadoop.streaming.PipeMapRed.waitOutputThreads(PipeMapRed.java:325)
at org.apache.hadoop.streaming.PipeMapRed.mapRedFinished(PipeMapRed.java:538)
at org.apache.hadoop.streaming.PipeMapper.close(PipeMapper.java:130)
at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:61)
at org.apache.hadoop.streaming.PipeMapRunner.run(PipeMapRunner.java:34)
at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:458)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:342)
at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:177)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:422)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1893)
at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:171)

19/11/12 02:11:24 INFO mapreduce.Job: Task Id : attempt_1573523684358_0002_m_000007_0, Status : FAILED
Error: java.lang.RuntimeException: PipeMapRed.waitOutputThreads(): subprocess failed with code 1
at org.apache.hadoop.streaming.PipeMapRed.waitOutputThreads(PipeMapRed.java:325)
at org.apache.hadoop.streaming.PipeMapRed.mapRedFinished(PipeMapRed.java:538)
at org.apache.hadoop.streaming.PipeMapper.close(PipeMapper.java:130)
at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:61)
at org.apache.hadoop.streaming.PipeMapRunner.run(PipeMapRunner.java:34)
at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:458)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:342)
at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:177)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:422)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1893)
at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:171)
19/11/12 02:11:28 INFO mapreduce.Job: Task Id : attempt_1573523684358_0002_m_000002_1, Status : FAILED
Error: java.lang.RuntimeException: PipeMapRed.waitOutputThreads(): subprocess failed with code 1
at org.apache.hadoop.streaming.PipeMapRed.waitOutputThreads(PipeMapRed.java:325)
at org.apache.hadoop.streaming.PipeMapRed.mapRedFinished(PipeMapRed.java:538)
at org.apache.hadoop.streaming.PipeMapper.close(PipeMapper.java:130)
at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:61)
at org.apache.hadoop.streaming.PipeMapRunner.run(PipeMapRunner.java:34)
at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:458)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:342)
at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:177)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:422)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1893)
at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:171)
19/11/12 02:11:30 INFO mapreduce.Job: Task Id : attempt_1573523684358_0002_m_000004_1, Status : FAILED
Error: java.lang.RuntimeException: PipeMapRed.waitOutputThreads(): subprocess failed with code 1
at org.apache.hadoop.streaming.PipeMapRed.waitOutputThreads(PipeMapRed.java:325)
at org.apache.hadoop.streaming.PipeMapRed.mapRedFinished(PipeMapRed.java:538)
at org.apache.hadoop.streaming.PipeMapper.close(PipeMapper.java:130)
at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:61)
at org.apache.hadoop.streaming.PipeMapRunner.run(PipeMapRunner.java:34)
at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:458)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:342)
at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:177)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:422)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1893)
at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:171)
19/11/12 02:11:37 INFO mapreduce.Job: Task Id : attempt_1573523684358_0002_m_000001_2, Status : FAILED
Error: java.lang.RuntimeException: PipeMapRed.waitOutputThreads(): subprocess failed with code 1
at org.apache.hadoop.streaming.PipeMapRed.waitOutputThreads(PipeMapRed.java:325)
at org.apache.hadoop.streaming.PipeMapRed.mapRedFinished(PipeMapRed.java:538)
at org.apache.hadoop.streaming.PipeMapper.close(PipeMapper.java:130)
at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:61)
at org.apache.hadoop.streaming.PipeMapRunner.run(PipeMapRunner.java:34)
at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:458)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:342)
at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:177)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:422)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1893)
at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:171)
19/11/12 02:11:38 INFO mapreduce.Job: Task Id : attempt_1573523684358_0002_m_000000_2, Status : FAILED
Error: java.lang.RuntimeException: PipeMapRed.waitOutputThreads(): subprocess failed with code 1
at org.apache.hadoop.streaming.PipeMapRed.waitOutputThreads(PipeMapRed.java:325)
at org.apache.hadoop.streaming.PipeMapRed.mapRedFinished(PipeMapRed.java:538)
at org.apache.hadoop.streaming.PipeMapper.close(PipeMapper.java:130)
at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:61)
at org.apache.hadoop.streaming.PipeMapRunner.run(PipeMapRunner.java:34)
at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:458)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:342)
at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:177)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:422)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1893)
at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:171)
19/11/12 02:11:38 INFO mapreduce.Job: Task Id : attempt_1573523684358_0002_m_000003_1, Status : FAILED
Error: java.lang.RuntimeException: PipeMapRed.waitOutputThreads(): subprocess failed with code 1
at org.apache.hadoop.streaming.PipeMapRed.waitOutputThreads(PipeMapRed.java:325)
at org.apache.hadoop.streaming.PipeMapRed.mapRedFinished(PipeMapRed.java:538)
at org.apache.hadoop.streaming.PipeMapper.close(PipeMapper.java:130)
at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:61)
at org.apache.hadoop.streaming.PipeMapRunner.run(PipeMapRunner.java:34)
at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:458)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:342)
at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:177)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:422)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1893)
at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:171)
19/11/12 02:11:39 INFO mapreduce.Job: Task Id : attempt_1573523684358_0002_m_000005_1, Status : FAILED
Error: java.lang.RuntimeException: PipeMapRed.waitOutputThreads(): subprocess failed with code 1
at org.apache.hadoop.streaming.PipeMapRed.waitOutputThreads(PipeMapRed.java:325)
at org.apache.hadoop.streaming.PipeMapRed.mapRedFinished(PipeMapRed.java:538)
at org.apache.hadoop.streaming.PipeMapper.close(PipeMapper.java:130)
at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:61)
at org.apache.hadoop.streaming.PipeMapRunner.run(PipeMapRunner.java:34)
at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:458)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:342)
at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:177)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:422)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1893)
at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:171)
19/11/12 02:11:40 INFO mapreduce.Job: Task Id : attempt_1573523684358_0002_m_000006_1, Status : FAILED
Error: java.lang.RuntimeException: PipeMapRed.waitOutputThreads(): subprocess failed with code 1
at org.apache.hadoop.streaming.PipeMapRed.waitOutputThreads(PipeMapRed.java:325)
at org.apache.hadoop.streaming.PipeMapRed.mapRedFinished(PipeMapRed.java:538)
at org.apache.hadoop.streaming.PipeMapper.close(PipeMapper.java:130)
at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:61)
at org.apache.hadoop.streaming.PipeMapRunner.run(PipeMapRunner.java:34)
at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:458)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:342)
at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:177)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:422)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1893)
at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:171)
19/11/12 02:11:48 INFO mapreduce.Job: Task Id : attempt_1573523684358_0002_m_000007_1, Status : FAILED
Error: java.lang.RuntimeException: PipeMapRed.waitOutputThreads(): subprocess failed with code 1
at org.apache.hadoop.streaming.PipeMapRed.waitOutputThreads(PipeMapRed.java:325)
at org.apache.hadoop.streaming.PipeMapRed.mapRedFinished(PipeMapRed.java:538)
at org.apache.hadoop.streaming.PipeMapper.close(PipeMapper.java:130)
at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:61)
at org.apache.hadoop.streaming.PipeMapRunner.run(PipeMapRunner.java:34)
at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:458)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:342)
at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:177)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:422)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1893)
at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:171)
19/11/12 02:11:49 INFO mapreduce.Job: map 80% reduce 0%
19/11/12 02:11:50 INFO mapreduce.Job: map 100% reduce 100%
19/11/12 02:11:50 INFO mapreduce.Job: Job job_1573523684358_0002 failed with state FAILED due to: Task failed task_1573523684358_0002_m_000001
Job failed as tasks failed. failedMaps:1 failedReduces:0
19/11/12 02:11:50 INFO mapreduce.Job: Counters: 14
Job Counters
Failed map tasks=19
Killed map tasks=14
Killed reduce tasks=5
Launched map tasks=22
Other local map tasks=14
Rack-local map tasks=8
Total time spent by all maps in occupied slots (ms)=885928
Total time spent by all reduces in occupied slots (ms)=0
Total time spent by all map tasks (ms)=221482
Total vcore-milliseconds taken by all map tasks=221482
Total megabyte-milliseconds taken by all map tasks=453595136
Map-Reduce Framework
CPU time spent (ms)=0
Physical memory (bytes) snapshot=0
Virtual memory (bytes) snapshot=0
19/11/12 02:11:50 ERROR streaming.StreamJob: Job not successful!
Streaming Command Failed!

而且,老实说,我现在不知道该怎么办。我花了很多时间在这个问题上,我不确定自己到底出了什么问题,所以感觉就像在砖墙上。

我也尝试过:

hadoop jar /usr/lib/hadoop-mapreduce/hadoop-streaming.jar \
-files gs://data-604-hadoop/mapper.py,gs://data-604-hadoop/reducer.py \
-mapper ./mapper.py \
-reducer ./reducer.py \
-input gs://data-604-hadoop/books/pg20417.txt \
-output gs://data-604-hadoop/output

结果相似。

感谢您的帮助。

更新:
我尝试了几件事,但没有成功。
我尝试将python脚本移动到Hadoop集群上。然后,我用 head -n100 mobydick.txt | ./mapper.py | sort | ./reducer.py测试了它们,它们起作用了。在下面的评论中,我提到我调查了我的shebang并进行了更改,但这些操作均未成功。

最佳答案

这里发生了一些不同的事情,但是主要是归结为不一定能够假定每个mapper / reducer任务(作为YARN容器运行)的系统环境必然具有相同的系统环境。作为您登录的 shell 程序。在大多数情况下,许多元素将有意地有所不同(例如Java类路径等)。通常,对于基于Java的MapReduce程序,此操作可以按预期工作,因为最终会在hadoop jar命令下运行的驱动程序代码与YARN容器中的工作程序节点上运行的执行程序代码之间具有相似的环境变量和类路径。 Hadoop流媒体有点奇怪,因为在正常的Hadoop使用中它并不是一流的公民。

无论如何,在这种情况下,您遇到的主要问题是登录集群时的默认Python是Conda发行版和Python 3.7版,但是在YARN环境中生成映射程序/缩减程序任务的默认Python版本实际上是Python 2.7。这是Dataproc中某些旧兼容性考虑的不幸结果。您可以通过破解mapper.py充当所需环境信息的转储来查看实际情况,例如,在将SSH SSH到Dataproc集群时尝试运行以下命令:

echo foo > foo.txt
hdfs dfs -mkdir hdfs:///foo
hdfs dfs -put foo.txt hdfs:///foo/foo.txt
echo '#!/bin/bash' > info.sh
echo 'which python' >> info.sh
echo 'python --version 2>&1' >> info.sh
hadoop jar /usr/lib/hadoop-mapreduce/hadoop-streaming.jar \
-Dmapred.reduce.tasks=0 \
-Dmapred.map.tasks=1 \
-files info.sh \
-input hdfs:///foo \
-output hdfs:///info-output \
-mapper ./info.sh

您的本地环境将显示与在hdfs:/// info-output上打印的版本不同的Python版本:
$ bash info.sh
/opt/conda/bin/python
Python 3.7.3
$ hdfs dfs -cat hdfs:///info-output/*
/usr/bin/python
Python 2.7.15+

这意味着您可以使映射器/缩减器Python 2.7兼容,也可以在shebang中显式指定 /opt/conda/bin/python。在复制设置的设置(Dataproc 1.4-ubuntu18加上jupyter.sh init操作)中,以下内容对我有用:
#!/opt/conda/bin/python
"""mapper.py"""

import sys

# input comes from STDIN (standard input)
for line in sys.stdin:
# remove leading and trailing whitespace
line = line.strip()
# split the line into words
words = line.split()
# increase counters
# increase counters
for word in words:
# write the results to STDOUT (standard output);
# what we output here will be the input for the
# Reduce step, i.e. the input for reducer.py
#
# tab-delimited; the trivial word count is 1
#print ('%s\t%s' % (word, 1))
print(f"{word}\t{1}")

和 reducer :
#!/opt/conda/bin/python
"""reducer.py"""

from operator import itemgetter
import sys

print_out = lambda x, y: print(f'{x}\t{y}')

current_word = None
current_count = 0
word = None

# input comes from STDIN (standard input)
for line in sys.stdin:
# remove leading and trailing whitespace
line = line.strip()

# parse the input we got from mapper.py
word, count = line.split('\t', 1)

# convert count (currently a string) to int
try:
count = int(count)
except ValueError:
# count was not a number, so silently
# ignore/discard this line
continue
#print("still working")

# this IF-switch only works because Hadoop sorts map output
# by key (here: word) before it is passed to the reducer
if current_word == word:
current_count += count
else:
if current_word:
# write result to STDOUT
#print '%s\t%s' % (current_word, current_count)
print_out(current_word, current_count)
current_count = count
current_word = word

# do not forget to output the last word if needed!
if current_word == word:
#print '%s\t%s' % (current_word, current_count)
print_out(current_word, current_count)

但是,要记住的另一件事是jupyter.sh初始化操作已经过时了,相反,您应该使用实际的 Dataproc Jupyter Component。无论如何,尽管您可能想要先运行上面概述的 info.sh步骤,以确定要在mapper.py和reducer.py中使用的相关Python环境。

例如,没有jupyter.sh init操作的 Vanilla Dataproc 1.4-debian9实际上将使登录的默认Python为 /opt/conda/default/bin/python而不是 /opt/conda/bin/python。并且Dataproc 1.3-debian9镜像将 /usr/bin/python作为Python 2.7的默认值。

关于python - Dataproc Hadoop MapReduce-无法正常工作,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/58811116/

26 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com