gpt4 book ai didi

xml - 使用Hadoop流处理XML

转载 作者:行者123 更新时间:2023-12-02 19:52:04 25 4
gpt4 key购买 nike

我做到了

bin/hadoop jar contrib/streaming/hadoop-streaming-1.0.4.jar -inputreader "StreamXmlRecordReader, begin=<metaData>,end=</metaData>" -input /user/root/xmlpytext/metaData.xml -mapper /Users/amrita/desktop/hadoop/pythonpractise/mapperxml.py -file /Users/amrita/desktop/hadoop/pythonpractise/mapperxml.py -reducer /Users/amrita/desktop/hadoop/pythonpractise/reducerxml.py  -file /Users/amrita/desktop/hadoop/pythonpractise/mapperxml.py -output /user/root/xmlpytext-output1 -numReduceTasks 1

但它显示
13/03/22 09:38:48 INFO mapred.FileInputFormat: Total input paths to process : 1
13/03/22 09:38:49 INFO streaming.StreamJob: getLocalDirs(): [/Users/amrita/desktop/hadoop/temp/mapred/local]
13/03/22 09:38:49 INFO streaming.StreamJob: Running job: job_201303220919_0001
13/03/22 09:38:49 INFO streaming.StreamJob: To kill this job, run:
13/03/22 09:38:49 INFO streaming.StreamJob: /private/var/root/hadoop-1.0.4/libexec/../bin/hadoop job -Dmapred.job.tracker=-kill job_201303220919_0001
13/03/22 09:38:49 INFO streaming.StreamJob: Tracking URL: http://localhost:50030/jobdetails.jsp?jobid=job_201303220919_0001
13/03/22 09:38:50 INFO streaming.StreamJob: map 0% reduce 0%
13/03/22 09:39:26 INFO streaming.StreamJob: map 100% reduce 100%
13/03/22 09:39:26 INFO streaming.StreamJob: To kill this job, run:
13/03/22 09:39:26 INFO streaming.StreamJob: /private/var/root/hadoop-1.0.4/libexec/../bin/hadoop job -Dmapred.job.tracker=-kill job_201303220919_0001
13/03/22 09:39:26 INFO streaming.StreamJob: Tracking URL: http:///jobdetails.jsp?jobid=job_201303220919_0001
13/03/22 09:39:26 ERROR streaming.StreamJob: Job not successful. Error: # of failed Map Tasks exceeded allowed limit. FailedCount: 1. LastFailedTask: task_201303220919_0001_m_000000
13/03/22 09:39:26 INFO streaming.StreamJob: killJob...
Streaming Command Failed!

当我经历jobdetails.jsp时,它显示
java.lang.RuntimeException: java.lang.reflect.InvocationTargetException
at org.apache.hadoop.streaming.StreamInputFormat.getRecordReader(StreamInputFormat.java:77)
at org.apache.hadoop.mapred.MapTask$TrackedRecordReader.<init>(MapTask.java:197)
at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:418)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:372)
at org.apache.hadoop.mapred.Child$4.run(Child.java:255)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:396)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1121)
at org.apache.hadoop.mapred.Child.main(Child.java:249)
Caused by: java.lang.reflect.InvocationTargetException
at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:39)
at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:27)
at java.lang.reflect.Constructor.newInstance(Constructor.java:513)
at org.apache.hadoop.streaming.StreamInputFormat.getRecordReader(StreamInputFormat.java:74)
... 8 more
Caused by: java.io.IOException: JobConf: missing required property: stream.recordreader.begin
at org.apache.hadoop.streaming.StreamXmlRecordReader.checkJobGet(StreamXmlRecordReader.java:278)
at org.apache.hadoop.streaming.StreamXmlRecordReader.<init>(StreamXmlRecordReader.java:52)
... 13 more

我的 map 绘制者
#!/usr/bin/env python
import sys
import cStringIO
import xml.etree.ElementTree as xml
def cleanResult(element):
result = None
if element is not None:
result = element.text
result = result.strip()
else:
result = ""
return result
def process(val):
root = xml.fromstring(val)
sceneID = cleanResult(root.find('sceneID'))
cc = cleanResult(root.find('cloudCover'))
returnval = ("%s,%s") % (sceneID,cc)
return returnval.strip()
if __name__ == '__main__':
buff = None
intext = False
for line in sys.stdin:
line = line.strip()
if line.find("<metaData>") != -1:
intext = True
buff = cStringIO.StringIO()
buff.write(line)
elif line.find("</metaData>") != -1:
intext = False
buff.write(line)
val = buff.getvalue()
buff.close()
buff = None
print process(val)
else:
if intext:
buff.write(line)

和 reducer
#!/usr/bin/env python
import sys
if __name__ == '__main__':
for line in sys.stdin:
print line.strip()

谁能告诉我为什么会这样。
我正在使用hadoop-1.0.4 im mac。
有什么问题吗。我应该改变什么吗?
请帮助我。

最佳答案

尝试如下设置缺少的配置变量(添加stream.recordreader.前缀,并确保它们是jar后面的第一个参数,并将其括在双引号中):

bin/hadoop jar contrib/streaming/hadoop-streaming-1.0.4.jar \
"-Dstream.recordreader.begin=<metaData>" \
"-Dstream.recordreader.end=</metaData>" \
-inputreader "StreamXmlRecordReader \
-input /user/root/xmlpytext/metaData.xml \
-mapper /Users/amrita/desktop/hadoop/pythonpractise/mapperxml.py \
-file /Users/amrita/desktop/hadoop/pythonpractise/mapperxml.py \
-reducer /Users/amrita/desktop/hadoop/pythonpractise/reducerxml.py \
-file /Users/amrita/desktop/hadoop/pythonpractise/mapperxml.py \
-output /user/root/xmlpytext-output1 \
-numReduceTasks 1

关于xml - 使用Hadoop流处理XML,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/15562650/

25 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com