gpt4 book ai didi

python - 如何创建一个Hadoop运行者?

转载 作者:行者123 更新时间:2023-12-02 21:52:56 25 4
gpt4 key购买 nike

我有以下简单的mrjob脚本,该脚本逐行读取一个大文件,在每一行上执行一个操作并打印输出:

#!/usr/bin/env python                                                                                                           

from mrjob.job import MRJob

class LineProcessor(MRJob):
def mapper(self, _, line):
yield (line.upper(), None) # toy example: mapper just uppercase the line

if __name__ == '__main__':
# mr_job = LineProcessor(args=['-r', 'hadoop', '/path/to/input']) # error!
mr_job = LineProcessor(args=['/path/to/input'])
with mr_job.make_runner() as runner:
runner.run()
for line in runner.stream_output():
key, value = mr_job.parse_output_line(line)
print key.encode('utf-8') # don't care about value in my case

(这只是一个玩具示例;在我的实际情况下,处理每行都是昂贵的,这就是为什么我要分布式运行的原因。)

它仅作为本地进程。如果我尝试使用 '-r', 'hadoop'(请参见上面的注释),则会出现以下奇怪错误:
  File "mrjob/runner.py", line 727, in _get_steps
'error getting step information: %s', stderr)
Exception: ('error getting step information: %s', 'Traceback (most recent call last):\n File "script.py", line 11, in <module>\n with mr_job.make_runner() as runner:\n File "mrjob/job.py", line 515, in make_runner\n " __main__, which doesn\'t work." % w)\nmrjob.job.UsageError: make_runner() was called with --steps. This probably means you tried to use it from __main__, which doesn\'t work.\n')

如何实际在hadoop上运行它,即创建 HadoopJobRunner

最佳答案

你想念吗

def steps(self):
return [self.mr(
mapper_init = ...
mapper = self.mapper,
combiner = ...,
reducer = ...,
)]

在您的LineProcessor中?

关于python - 如何创建一个Hadoop运行者?,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/18455538/

25 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com