gpt4 book ai didi

python - mrjob 找不到输入文件

转载 作者:可可西里 更新时间:2023-11-01 15:02:19 25 4
gpt4 key购买 nike

我正在使用 cloudera 虚拟机。这是我的文件结构:

[cloudera@quickstart pydoop]$ hdfs dfs -ls -R /input
drwxr-xr-x - cloudera supergroup 0 2015-10-02 15:00 /input/test1
-rw-r--r-- 1 cloudera supergroup 62 2015-10-02 15:00 /input/test1/file1.txt
drwxr-xr-x - cloudera supergroup 0 2015-10-02 14:59 /input/test2
-rw-r--r-- 1 cloudera supergroup 1428841 2015-10-02 14:59 /input/test2/5000-8.txt
-rw-r--r-- 1 cloudera supergroup 674570 2015-10-02 14:59 /input/test2/pg20417.txt
-rw-r--r-- 1 cloudera supergroup 1573151 2015-10-02 14:59 /input/test2/pg4300.txt

这是我执行 wordcount 示例的代码:

python /home/cloudera/MapReduceCode/mrjob/ -r hadoop hdfs://input/test1/file1.txt


[cloudera@quickstart hadoop]$ python /home/cloudera/MapReduceCode/mrjob/ -r hadoop hdfs://input/test1/file1.txt
no configs found; falling back on auto-configuration
no configs found; falling back on auto-configuration
Traceback (most recent call last):
File "/home/cloudera/MapReduceCode/mrjob/", line 13, in <module>
File "/usr/local/lib/python2.7/site-packages/mrjob/", line 461, in run
File "/usr/local/lib/python2.7/site-packages/mrjob/", line 479, in execute
super(MRJob, self).execute()
File "/usr/local/lib/python2.7/site-packages/mrjob/", line 153, in execute
File "/usr/local/lib/python2.7/site-packages/mrjob/", line 216, in run_job
File "/usr/local/lib/python2.7/site-packages/mrjob/", line 470, in run
File "/usr/local/lib/python2.7/site-packages/mrjob/", line 233, in _run
File "/usr/local/lib/python2.7/site-packages/mrjob/", line 247, in _check_input_exists
if not self.path_exists(path):
File "/usr/local/lib/python2.7/site-packages/mrjob/fs/", line 78, in path_exists
return self._do_action('path_exists', path_glob)
File "/usr/local/lib/python2.7/site-packages/mrjob/fs/", line 54, in _do_action
return getattr(fs, action)(path, *args, **kwargs)
File "/usr/local/lib/python2.7/site-packages/mrjob/fs/", line 212, in path_exists
File "/usr/local/lib/python2.7/site-packages/mrjob/fs/", line 86, in invoke_hadoop
proc = Popen(args, stdout=PIPE, stderr=PIPE)
File "/usr/local/lib/python2.7/", line 709, in __init__
errread, errwrite)
File "/usr/local/lib/python2.7/", line 1326, in _execute_child
raise child_exception
OSError: [Errno 2] No such file or directory


请在 Cloudera Quickstart VM 上按照以下步骤使其正常工作。

  1. 确保 HADOOP_HOME 已设置。

    export HADOOP_HOME=/usr/lib/hadoop

  2. 创建 symlink 到 **hadoop-streaming.jar

    sudo ln -s/usr/lib/hadoop-mapreduce/hadoop-streaming.jar/usr/lib/hadoop

  3. 使用 hdfs:/// 而不是 hdfs://

    python/home/cloudera/MapReduceCode/mrjob/ -r hadoop hdfs:///input/test1/file1.txt

下面是 my cloudera quickstart VM 的完整 mrjob 结果。

注意 和 file1.txt 的位置与您的不同,但没关系。

[cloudera@quickstart ~]$ python -r hadoop hdfs:///user/cloudera/file1.txt
no configs found; falling back on auto-configuration
no configs found; falling back on auto-configuration
creating tmp directory /tmp/wordcount1.cloudera.20151011.115958.773999
writing wrapper script to /tmp/wordcount1.cloudera.20151011.115958.773999/
Using Hadoop version 2.6.0
Copying local files into hdfs:///user/cloudera/tmp/mrjob/wordcount1.cloudera.20151011.115958.773999/files/

PLEASE NOTE: Starting in mrjob v0.5.0, protocols will be strict by default. It's recommended you run your job with --strict-protocols or set up mrjob.conf as described at

HADOOP: packageJobJar: [] [/usr/jars/hadoop-streaming-2.6.0-cdh5.4.2.jar] /tmp/streamjob3860196653022444549.jar tmpDir=null
HADOOP: Connecting to ResourceManager at quickstart.cloudera/
HADOOP: Connecting to ResourceManager at quickstart.cloudera/
HADOOP: Total input paths to process : 1
HADOOP: number of splits:2
HADOOP: Submitting tokens for job: job_1444564543695_0003
HADOOP: Submitted application application_1444564543695_0003
HADOOP: The url to track the job: http://quickstart.cloudera:8088/proxy/application_1444564543695_0003/
HADOOP: Running job: job_1444564543695_0003
HADOOP: Job job_1444564543695_0003 running in uber mode : false
HADOOP: map 0% reduce 0%
HADOOP: map 100% reduce 0%
HADOOP: map 100% reduce 100%
HADOOP: Job job_1444564543695_0003 completed successfully
HADOOP: Counters: 49
HADOOP: File System Counters
HADOOP: FILE: Number of bytes read=105
HADOOP: FILE: Number of bytes written=356914
HADOOP: FILE: Number of read operations=0
HADOOP: FILE: Number of large read operations=0
HADOOP: FILE: Number of write operations=0
HADOOP: HDFS: Number of bytes read=322
HADOOP: HDFS: Number of bytes written=32
HADOOP: HDFS: Number of read operations=9
HADOOP: HDFS: Number of large read operations=0
HADOOP: HDFS: Number of write operations=2
HADOOP: Job Counters
HADOOP: Launched map tasks=2
HADOOP: Launched reduce tasks=1
HADOOP: Data-local map tasks=2
HADOOP: Total time spent by all maps in occupied slots (ms)=1164160
HADOOP: Total time spent by all reduces in occupied slots (ms)=350080
HADOOP: Total time spent by all map tasks (ms)=9095
HADOOP: Total time spent by all reduce tasks (ms)=2735
HADOOP: Total vcore-seconds taken by all map tasks=9095
HADOOP: Total vcore-seconds taken by all reduce tasks=2735
HADOOP: Total megabyte-seconds taken by all map tasks=1164160
HADOOP: Total megabyte-seconds taken by all reduce tasks=350080
HADOOP: Map-Reduce Framework
HADOOP: Map input records=5
HADOOP: Map output records=15
HADOOP: Map output bytes=153
HADOOP: Map output materialized bytes=152
HADOOP: Input split bytes=214
HADOOP: Combine input records=0
HADOOP: Combine output records=0
HADOOP: Reduce input groups=3
HADOOP: Reduce shuffle bytes=152
HADOOP: Reduce input records=15
HADOOP: Reduce output records=3
HADOOP: Spilled Records=30
HADOOP: Shuffled Maps =2
HADOOP: Failed Shuffles=0
HADOOP: Merged Map outputs=2
HADOOP: GC time elapsed (ms)=148
HADOOP: CPU time spent (ms)=1470
HADOOP: Physical memory (bytes) snapshot=428871680
HADOOP: Virtual memory (bytes) snapshot=2197188608
HADOOP: Total committed heap usage (bytes)=144179200
HADOOP: Shuffle Errors
HADOOP: File Input Format Counters
HADOOP: Bytes Read=108
HADOOP: File Output Format Counters
HADOOP: Bytes Written=32
HADOOP: Output directory: hdfs:///user/cloudera/tmp/mrjob/wordcount1.cloudera.20151011.115958.773999/output
Counters from step 1:
(no counters found)
Streaming final output from hdfs:///user/cloudera/tmp/mrjob/wordcount1.cloudera.20151011.115958.773999/output
"chars" 67
"lines" 5
"words" 16
removing tmp directory /tmp/wordcount1.cloudera.20151011.115958.773999
deleting hdfs:///user/cloudera/tmp/mrjob/wordcount1.cloudera.20151011.115958.773999 from HDFS
[cloudera@quickstart ~]$

关于python - mrjob 找不到输入文件,我们在Stack Overflow上找到一个类似的问题:

25 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号