作者热门文章
- android - RelativeLayout 背景可绘制重叠内容
- android - 如何链接 cpufeatures lib 以获取 native android 库?
- java - OnItemClickListener 不起作用,但 OnLongItemClickListener 在自定义 ListView 中起作用
- java - Android 文件转字符串
我正在尝试以编程方式运行简单的 wordcount 示例,但我无法使代码在 hadoop 集群上运行。
from mrjob.job import MRJob
import re
WORD_RE = re.compile(r"[\w']+")
class MRWordFreqCount(MRJob):
def mapper(self, _, line):
for word in WORD_RE.findall(line):
yield word.lower(), 1
def combiner(self, word, counts):
yield word, sum(counts)
def reducer(self, word, counts):
yield word, sum(counts)
from test_jobs import MRWordFreqCount
def test_runner(in_args, input_dir):
tmp_output = []
args = in_args + input_dir
mr_job = MRWordFreqCount(args.split())
with mr_job.make_runner() as runner:
runner.run()
for line in runner.stream_output():
tmp_output = tmp_output + [line]
return tmp_output
if __name__ == '__main__':
input_dir = 'hdfs:///test_input/'
args = '-r hadoop '
print test_runner(args, input_dir)
我可以在本地运行此代码(使用 inline
选项),但在 hadoop 上我得到:
> Traceback (most recent call last): File "mr_job_tester.py", line 17,
> in <module>
> print test_runner(args, input_dir) File "mr_job_tester.py", line 8, in test_runner
> runner.run() File "/usr/local/lib/python2.7/dist-packages/mrjob/runner.py", line 458, in
> run
> self._run() File "/usr/local/lib/python2.7/dist-packages/mrjob/hadoop.py", line 239, in
> _run
> self._run_job_in_hadoop() File "/usr/local/lib/python2.7/dist-packages/mrjob/hadoop.py", line 295, in
> _run_job_in_hadoop
> for step_num in xrange(self._num_steps()): File "/usr/local/lib/python2.7/dist-packages/mrjob/runner.py", line 742, in
> _num_steps
> return len(self._get_steps()) File "/usr/local/lib/python2.7/dist-packages/mrjob/runner.py", line 721, in
> _get_steps
> raise ValueError("Bad --steps response: \n%s" % stdout) ValueError: Bad --steps response:
最佳答案
( According to this )mrjob 提交作业文件并在 mapper 和 reducer 内部远程执行它的方式,使得下面的 foe 行必须在作业声明文件中:
if __name__ == "__main__":
MRWordFreqCount.run()
关于python - mrjob bad --steps 在 Hadoop 集群上使用 make_runner 时出错,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/26449811/
我正在尝试以编程方式运行简单的 wordcount 示例,但我无法使代码在 hadoop 集群上运行。 test_job.py 中的作业: from mrjob.job import MRJob im
我是一名优秀的程序员,十分优秀!