gpt4 book ai didi

python - mrjob 找不到输入文件

转载 作者:可可西里 更新时间:2023-11-01 15:02:19 25 4
gpt4 key购买 nike

我正在使用 cloudera 虚拟机。这是我的文件结构:

[cloudera@quickstart pydoop]$ hdfs dfs -ls -R /input
drwxr-xr-x - cloudera supergroup 0 2015-10-02 15:00 /input/test1
-rw-r--r-- 1 cloudera supergroup 62 2015-10-02 15:00 /input/test1/file1.txt
drwxr-xr-x - cloudera supergroup 0 2015-10-02 14:59 /input/test2
-rw-r--r-- 1 cloudera supergroup 1428841 2015-10-02 14:59 /input/test2/5000-8.txt
-rw-r--r-- 1 cloudera supergroup 674570 2015-10-02 14:59 /input/test2/pg20417.txt
-rw-r--r-- 1 cloudera supergroup 1573151 2015-10-02 14:59 /input/test2/pg4300.txt

这是我执行 wordcount 示例的代码:

python /home/cloudera/MapReduceCode/mrjob/wordcount1.py -r hadoop hdfs://input/test1/file1.txt

它因以下内容而崩溃。好像找不到文件。

[cloudera@quickstart hadoop]$ python /home/cloudera/MapReduceCode/mrjob/wordcount1.py -r hadoop hdfs://input/test1/file1.txt
no configs found; falling back on auto-configuration
no configs found; falling back on auto-configuration
Traceback (most recent call last):
File "/home/cloudera/MapReduceCode/mrjob/wordcount1.py", line 13, in <module>
MRWordCount.run()
File "/usr/local/lib/python2.7/site-packages/mrjob/job.py", line 461, in run
mr_job.execute()
File "/usr/local/lib/python2.7/site-packages/mrjob/job.py", line 479, in execute
super(MRJob, self).execute()
File "/usr/local/lib/python2.7/site-packages/mrjob/launch.py", line 153, in execute
self.run_job()
File "/usr/local/lib/python2.7/site-packages/mrjob/launch.py", line 216, in run_job
runner.run()
File "/usr/local/lib/python2.7/site-packages/mrjob/runner.py", line 470, in run
self._run()
File "/usr/local/lib/python2.7/site-packages/mrjob/hadoop.py", line 233, in _run
self._check_input_exists()
File "/usr/local/lib/python2.7/site-packages/mrjob/hadoop.py", line 247, in _check_input_exists
if not self.path_exists(path):
File "/usr/local/lib/python2.7/site-packages/mrjob/fs/composite.py", line 78, in path_exists
return self._do_action('path_exists', path_glob)
File "/usr/local/lib/python2.7/site-packages/mrjob/fs/composite.py", line 54, in _do_action
return getattr(fs, action)(path, *args, **kwargs)
File "/usr/local/lib/python2.7/site-packages/mrjob/fs/hadoop.py", line 212, in path_exists
ok_stderr=[_HADOOP_LS_NO_SUCH_FILE])
File "/usr/local/lib/python2.7/site-packages/mrjob/fs/hadoop.py", line 86, in invoke_hadoop
proc = Popen(args, stdout=PIPE, stderr=PIPE)
File "/usr/local/lib/python2.7/subprocess.py", line 709, in __init__
errread, errwrite)
File "/usr/local/lib/python2.7/subprocess.py", line 1326, in _execute_child
raise child_exception
OSError: [Errno 2] No such file or directory

最佳答案

请在 Cloudera Quickstart VM 上按照以下步骤使其正常工作。

  1. 确保 HADOOP_HOME 已设置。

    export HADOOP_HOME=/usr/lib/hadoop

  2. 创建 symlink 到 **hadoop-streaming.jar

    sudo ln -s/usr/lib/hadoop-mapreduce/hadoop-streaming.jar/usr/lib/hadoop

  3. 使用 hdfs:/// 而不是 hdfs://

    python/home/cloudera/MapReduceCode/mrjob/wordcount1.py -r hadoop hdfs:///input/test1/file1.txt

下面是 my cloudera quickstart VM 的完整 mrjob 结果。

注意:wordcount1.py 和 file1.txt 的位置与您的不同,但没关系。

[cloudera@quickstart ~]$ python wordcount1.py -r hadoop hdfs:///user/cloudera/file1.txt
no configs found; falling back on auto-configuration
no configs found; falling back on auto-configuration
creating tmp directory /tmp/wordcount1.cloudera.20151011.115958.773999
writing wrapper script to /tmp/wordcount1.cloudera.20151011.115958.773999/setup-wrapper.sh
Using Hadoop version 2.6.0
Copying local files into hdfs:///user/cloudera/tmp/mrjob/wordcount1.cloudera.20151011.115958.773999/files/

PLEASE NOTE: Starting in mrjob v0.5.0, protocols will be strict by default. It's recommended you run your job with --strict-protocols or set up mrjob.conf as described at https://pythonhosted.org/mrjob/whats-new.html#ready-for-strict-protocols

HADOOP: packageJobJar: [] [/usr/jars/hadoop-streaming-2.6.0-cdh5.4.2.jar] /tmp/streamjob3860196653022444549.jar tmpDir=null
HADOOP: Connecting to ResourceManager at quickstart.cloudera/127.0.0.1:8032
HADOOP: Connecting to ResourceManager at quickstart.cloudera/127.0.0.1:8032
HADOOP: Total input paths to process : 1
HADOOP: number of splits:2
HADOOP: Submitting tokens for job: job_1444564543695_0003
HADOOP: Submitted application application_1444564543695_0003
HADOOP: The url to track the job: http://quickstart.cloudera:8088/proxy/application_1444564543695_0003/
HADOOP: Running job: job_1444564543695_0003
HADOOP: Job job_1444564543695_0003 running in uber mode : false
HADOOP: map 0% reduce 0%
HADOOP: map 100% reduce 0%
HADOOP: map 100% reduce 100%
HADOOP: Job job_1444564543695_0003 completed successfully
HADOOP: Counters: 49
HADOOP: File System Counters
HADOOP: FILE: Number of bytes read=105
HADOOP: FILE: Number of bytes written=356914
HADOOP: FILE: Number of read operations=0
HADOOP: FILE: Number of large read operations=0
HADOOP: FILE: Number of write operations=0
HADOOP: HDFS: Number of bytes read=322
HADOOP: HDFS: Number of bytes written=32
HADOOP: HDFS: Number of read operations=9
HADOOP: HDFS: Number of large read operations=0
HADOOP: HDFS: Number of write operations=2
HADOOP: Job Counters
HADOOP: Launched map tasks=2
HADOOP: Launched reduce tasks=1
HADOOP: Data-local map tasks=2
HADOOP: Total time spent by all maps in occupied slots (ms)=1164160
HADOOP: Total time spent by all reduces in occupied slots (ms)=350080
HADOOP: Total time spent by all map tasks (ms)=9095
HADOOP: Total time spent by all reduce tasks (ms)=2735
HADOOP: Total vcore-seconds taken by all map tasks=9095
HADOOP: Total vcore-seconds taken by all reduce tasks=2735
HADOOP: Total megabyte-seconds taken by all map tasks=1164160
HADOOP: Total megabyte-seconds taken by all reduce tasks=350080
HADOOP: Map-Reduce Framework
HADOOP: Map input records=5
HADOOP: Map output records=15
HADOOP: Map output bytes=153
HADOOP: Map output materialized bytes=152
HADOOP: Input split bytes=214
HADOOP: Combine input records=0
HADOOP: Combine output records=0
HADOOP: Reduce input groups=3
HADOOP: Reduce shuffle bytes=152
HADOOP: Reduce input records=15
HADOOP: Reduce output records=3
HADOOP: Spilled Records=30
HADOOP: Shuffled Maps =2
HADOOP: Failed Shuffles=0
HADOOP: Merged Map outputs=2
HADOOP: GC time elapsed (ms)=148
HADOOP: CPU time spent (ms)=1470
HADOOP: Physical memory (bytes) snapshot=428871680
HADOOP: Virtual memory (bytes) snapshot=2197188608
HADOOP: Total committed heap usage (bytes)=144179200
HADOOP: Shuffle Errors
HADOOP: BAD_ID=0
HADOOP: CONNECTION=0
HADOOP: IO_ERROR=0
HADOOP: WRONG_LENGTH=0
HADOOP: WRONG_MAP=0
HADOOP: WRONG_REDUCE=0
HADOOP: File Input Format Counters
HADOOP: Bytes Read=108
HADOOP: File Output Format Counters
HADOOP: Bytes Written=32
HADOOP: Output directory: hdfs:///user/cloudera/tmp/mrjob/wordcount1.cloudera.20151011.115958.773999/output
Counters from step 1:
(no counters found)
Streaming final output from hdfs:///user/cloudera/tmp/mrjob/wordcount1.cloudera.20151011.115958.773999/output
"chars" 67
"lines" 5
"words" 16
removing tmp directory /tmp/wordcount1.cloudera.20151011.115958.773999
deleting hdfs:///user/cloudera/tmp/mrjob/wordcount1.cloudera.20151011.115958.773999 from HDFS
[cloudera@quickstart ~]$

关于python - mrjob 找不到输入文件,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/33005403/

25 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com