gpt4 book ai didi

python - 使用 python MRJob 在 EMR 上引导库

转载 作者:可可西里 更新时间:2023-11-01 15:15:37 31 4
gpt4 key购买 nike

问题陈述:

我正在尝试使用 python MRJob 库在 Amazon EMR 中运行 map-reduce 作业,但我在使用必要的库和包引导节点时遇到了问题。

详情:

我的示例 python mrjob 代码:

    import re
from mrjob.job import MRJob
from sentClassifier import sentClassify
import nltk

.. do something ..

有一些像 NLTK 这样的库需要导入,还有一些我正在导入的本地模块,比如 from sentClassifier import sentClassify

我想知道什么是引导 EMR 节点的最佳方式,以便这些方法和包可用。该代码在我的本地机器上运行良好。

我的示例 mrjob.conf 文件:

    runners:
emr:
aws_access_key_id: ***
aws_secret_access_key: ***
ec2_core_instance_type: m1.large
ec2_key_pair: mykey
ec2_key_pair_file: mykey.pem
num_ec2_core_instances: 5
pool_wait_minutes: 2
pool_emr_job_flows: true
ssh_tunnel_is_open: true
ssh_tunnel_to_job_tracker: true
hadoop:
setup:
- virtualenv venv
- . venv/bin/activate
- pip install mr3po simplejson
- sudo easy_install https://code.google.com/p/nltk/downloads/detail?name=nltk-2.0b9-py2.6.egg&can=2&q=

但是作业失败了。

我已经通读了以下引用资料并尝试了所有的各种方法,仍然没有成功:

错误日志:

    Scanning SSH logs for probable cause of failure
Probable cause of failure (from ssh://ec2-54-86-50-115.compute-1.amazonaws.com!172.31.19.60/mnt/var/log/hadoop/userlogs/job_201405030101_0006/attempt_201405030101_0006_m_000002_3/stderr):
Traceback (most recent call last):
File "obidroidMR.py", line 5, in <module>
import nltk
ImportError: No module named nltk
(while reading from s3://mrjob- 51b9493c1a467671/tmp/obidroidMR.shreyas.20140503.012933.336228/files/STDIN)
Attempting to terminate job...
Job appears to have already been terminated
Killing our SSH tunnel (pid 12909)
Traceback (most recent call last):
File "obidroidMR.py", line 107, in <module>
ObidroidReview.run()
File "/Users/shreyas/anaconda/envs/obidroid/lib/python2.7/site-packages/mrjob/job.py", line 494, in run
mr_job.execute()
File "/Users/shreyas/anaconda/envs/obidroid/lib/python2.7/site-packages/mrjob/job.py", line 512, in execute
super(MRJob, self).execute()
File "/Users/shreyas/anaconda/envs/obidroid/lib/python2.7/site-packages/mrjob/launch.py", line 147, in execute
self.run_job()
File "/Users/shreyas/anaconda/envs/obidroid/lib/python2.7/site-packages/mrjob/launch.py", line 208, in run_job
runner.run()
File "/Users/shreyas/anaconda/envs/obidroid/lib/python2.7/site-packages/mrjob/runner.py", line 458, in run
self._run()
File "/Users/shreyas/anaconda/envs/obidroid/lib/python2.7/site-packages/mrjob/emr.py", line 809, in _run
self._wait_for_job_to_complete()
File "/Users/shreyas/anaconda/envs/obidroid/lib/python2.7/site-packages/mrjob/emr.py", line 1599, in _wait_for_job_to_complete
raise Exception(msg)
Exception: Job on job flow j-2R8G1Q3RIE9ED failed with status WAITING: Waiting after step failed
Probable cause of failure (from ssh://ec2-54-86-50-115.compute-1.amazonaws.com!172.31.19.60/mnt/var/log/hadoop/userlogs/job_201405030101_0006/attempt_201405030101_0006_m_000002_3/stderr):
Traceback (most recent call last):
File "obidroidMR.py", line 5, in <module>
import nltk
ImportError: No module named nltk

非常感谢任何帮助

最佳答案

mrjob.conf 中,安装包所需的行可能不在应有的位置。应该应用于在 EMR 上运行的作业的东西应该列在 emr: 而不是 hadoop: 下(这是在本地 Hadoop 安装上运行作业时的配置.

如果它是一个简单的 Linux 命令,例如 pipapt-get,那么您应该能够像这样安装软件包:

runners:
emr:
aws_access_key_id: ***
... all the other stuff ...
bootstrap_cmds:
- sudo apt-get install -y python-boto
- sudo pip install simplejson

我从来没有尝试专门安装 NLTK,所以我无法在这方面帮助您,但您应该能够按照这条线安装。

对于可能更复杂的安装,我建议使用 EMR CLI ssh连接到您的主节点:

$ ./elastic-mapreduce -j JOB_FLOW_ID --ssh

然后实际尝试安装包。如果您找到一系列成功安装包的 shell 命令,那么您只需将其复制并粘贴到您的 mrjob.conf 中即可。

关于python - 使用 python MRJob 在 EMR 上引导库,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/23440564/

31 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com