gpt4 book ai didi

python - "Counters from Step 1: No Counters found"使用 Hadoop 和 mrjob

转载 作者:可可西里 更新时间:2023-11-01 14:58:00 27 4
gpt4 key购买 nike

我有一个 python 文件,用于在 Hadoop(版本 2.6.0)上使用 mrjob 来计算二元语法,但我没有得到我希望的输出,而且我在破译终端中的输出时遇到了问题我哪里出错了。

我的代码:

regex_for_words = re.compile(r"\b[\w']+\b")

class BiCo(MRJob):
OUTPUT_PROTOCOL = mrjob.protocol.RawProtocol

def mapper(self, _, line):
words = regex_for_words.findall(line)
wordsinline = list()
for word in words:
wordsinline.append(word.lower())
wordscounter = 0
totalwords = len(wordsinline)
for word in wordsinline:
if wordscounter < (totalwords - 1):
nextword_pos = wordscounter+1
nextword = wordsinline[nextword_pos]
bigram = word, nextword
wordscounter +=1
yield (bigram, 1)

def combiner(self, bigram, counts):
yield (bigram, sum(counts))

def reducer(self, bigram, counts):
yield (bigram, str(sum(counts)))

if __name__ == '__main__':
BiCo.run()

我在我的本地机器上的映射器函数中编写了代码(基本上,通过“yield”行的所有内容)以确保我的代码按预期抓取二元语法,所以我认为它应该工作正常......但是,当然,有些地方出了问题。

当我在 Hadoop 服务器上运行代码时,我得到以下输出(如果这超出了必要,我深表歉意 - 屏幕输出了大量信息,我还不确定哪些信息有助于磨练问题区域):

HADOOP: 2015-10-25 17:00:46,992 INFO  [main] mapreduce.Job (Job.java:monitorAndPrintJob(1355)) - Running job: job_1438612881113_6410
HADOOP: 2015-10-25 17:00:52,110 INFO [main] mapreduce.Job (Job.java:monitorAndPrintJob(1376)) - Job job_1438612881113_6410 running in uber mode : false
HADOOP: 2015-10-25 17:00:52,111 INFO [main] mapreduce.Job (Job.java:monitorAndPrintJob(1383)) - map 0% reduce 0%
HADOOP: 2015-10-25 17:00:58,171 INFO [main] mapreduce.Job (Job.java:monitorAndPrintJob(1383)) - map 33% reduce 0%
HADOOP: 2015-10-25 17:01:00,184 INFO [main] mapreduce.Job (Job.java:monitorAndPrintJob(1383)) - map 100% reduce 0%
HADOOP: 2015-10-25 17:01:07,222 INFO [main] mapreduce.Job (Job.java:monitorAndPrintJob(1383)) - map 100% reduce 100%
HADOOP: 2015-10-25 17:01:08,239 INFO [main] mapreduce.Job (Job.java:monitorAndPrintJob(1394)) - Job job_1438612881113_6410 completed successfully
HADOOP: 2015-10-25 17:01:08,321 INFO [main] mapreduce.Job (Job.java:monitorAndPrintJob(1401)) - Counters: 51
HADOOP: File System Counters
HADOOP: FILE: Number of bytes read=2007840
HADOOP: FILE: Number of bytes written=4485245
HADOOP: FILE: Number of read operations=0
HADOOP: FILE: Number of large read operations=0
HADOOP: FILE: Number of write operations=0
HADOOP: HDFS: Number of bytes read=1013129
HADOOP: HDFS: Number of bytes written=0
HADOOP: HDFS: Number of read operations=12
HADOOP: HDFS: Number of large read operations=0
HADOOP: HDFS: Number of write operations=2
HADOOP: Job Counters
HADOOP: Killed map tasks=1
HADOOP: Launched map tasks=4
HADOOP: Launched reduce tasks=1
HADOOP: Rack-local map tasks=4
HADOOP: Total time spent by all maps in occupied slots (ms)=33282
HADOOP: Total time spent by all reduces in occupied slots (ms)=12358
HADOOP: Total time spent by all map tasks (ms)=16641
HADOOP: Total time spent by all reduce tasks (ms)=6179
HADOOP: Total vcore-seconds taken by all map tasks=16641
HADOOP: Total vcore-seconds taken by all reduce tasks=6179
HADOOP: Total megabyte-seconds taken by all map tasks=51121152
HADOOP: Total megabyte-seconds taken by all reduce tasks=18981888
HADOOP: Map-Reduce Framework
HADOOP: Map input records=28214
HADOOP: Map output records=133627
HADOOP: Map output bytes=2613219
HADOOP: Map output materialized bytes=2007852
HADOOP: Input split bytes=304
HADOOP: Combine input records=133627
HADOOP: Combine output records=90382
HADOOP: Reduce input groups=79518
HADOOP: Reduce shuffle bytes=2007852
HADOOP: Reduce input records=90382
HADOOP: Reduce output records=0
HADOOP: Spilled Records=180764
HADOOP: Shuffled Maps =3
HADOOP: Failed Shuffles=0
HADOOP: Merged Map outputs=3
HADOOP: GC time elapsed (ms)=93
HADOOP: CPU time spent (ms)=7940
HADOOP: Physical memory (bytes) snapshot=1343377408
HADOOP: Virtual memory (bytes) snapshot=14458105856
HADOOP: Total committed heap usage (bytes)=4045406208
HADOOP: Shuffle Errors
HADOOP: BAD_ID=0
HADOOP: CONNECTION=0
HADOOP: IO_ERROR=0
HADOOP: WRONG_LENGTH=0
HADOOP: WRONG_MAP=0
HADOOP: WRONG_REDUCE=0
HADOOP: Unencodable output
HADOOP: TypeError=79518
HADOOP: File Input Format Counters
HADOOP: Bytes Read=1012825
HADOOP: File Output Format Counters
HADOOP: Bytes Written=0
HADOOP: 2015-10-25 17:01:08,321 INFO [main] streaming.StreamJob (StreamJob.java:submitAndMonitorJob(1022)) - Output directory: hdfs:///user/andersaa/si601f15lab5_output
Counters from step 1:
(no counters found)

我很困惑为什么在第 1 步中找不到计数器(我假设是我代码的映射器部分,这可能是错误的假设)。如果我正确读取任何 Hadoop 输出,看起来它至少进入了 reduce 阶段(因为有 Reduce Input 组)并且没有发现任何 Shuffling 错误。我认为对于“Unencodable output: TypeError=79518”中出现的问题可能有一些答案,但我所做的任何谷歌搜索都没有帮助磨练这是什么错误。

非常感谢任何帮助或见解。

最佳答案

一个问题是映射器的二元组编码。上面的编码方式,bigram 是 python 类型“元组”:

>>> word = 'the'
>>> word2 = 'boy'
>>> bigram = word, word2
>>> type(bigram)
<type 'tuple'>

通常,纯字符串用作键。因此,相反,将二元语法创建为字符串。您可以这样做的一种方法是:

bigram = '-'.join((word, nextword))

当我在您的程序中进行更改时,我会看到如下输出:

automatic-translation   1
automatic-vs 1
automatically-focus 1
automatically-learn 1
automatically-learning 1
automatically-translate 1
available-including 1
available-without 1

另一个提示:在您的命令行中尝试 -q 以消除所有 hadoop 中间噪音。有时它只是妨碍。

HTH.

关于python - "Counters from Step 1: No Counters found"使用 Hadoop 和 mrjob,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/33335706/

27 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com