gpt4 book ai didi

带有未安装在数据节点上的导入包的 Python Hadoop 流式传输

转载 作者:可可西里 更新时间:2023-11-01 14:58:56 26 4
gpt4 key购买 nike

我尝试在 python hadoop 流中导入 scikit 图像,我已经尝试了 stackoverflow 上的现有帖子 herehere ,但他们都没有解决我的问题。

真正的问题是,即使我使用 -file 分发带有打包的 scikit-image 文件夹的 zip/mod 文件,在数据节点上运行的 python 脚本如何知道如何提取这些包并导入到代码中?请注意,我已经在名称节点上安装了 python scikit-image,并且能够运行本地实验。

我的脚本很简单:python 流的经典字数统计示例,在 mapper.py 中有一个额外的“import skimage”。 http://www.michael-noll.com/tutorials/writing-an-hadoop-mapreduce-program-in-python


我的命令:

hadoop jar hadoop-streaming.jar \
-file mapper.py -mapper mapper.py \
-file reducer.py -reducer reducer.py \
-file ./skimage.mod \
-input /user/text/* \
-output /user/textoutput/

屏幕打印输出:

packageJobJar: [mapper.py, reducer.py, ./skimage.zip] [/usr/lib/gphd/hadoop-mapreduce-2.0.2_alpha_gphd_2_0_1_0/hadoop-streaming-2.0.2-alpha-gphd-2.0.1.0.jar] /tmp/streamjob6159562120374599467.jar tmpDir=null
14/04/04 18:00:02 INFO service.AbstractService: Service:org.apache.hadoop.yarn.client.YarnClientImpl is inited.
14/04/04 18:00:02 INFO service.AbstractService: Service:org.apache.hadoop.yarn.client.YarnClientImpl is started.
14/04/04 18:00:03 INFO service.AbstractService: Service:org.apache.hadoop.yarn.client.YarnClientImpl is inited.
14/04/04 18:00:03 INFO service.AbstractService: Service:org.apache.hadoop.yarn.client.YarnClientImpl is started.
14/04/04 18:00:03 WARN snappy.LoadSnappy: Snappy native library not loaded
14/04/04 18:00:03 INFO mapred.FileInputFormat: Total input paths to process : 1
14/04/04 18:00:03 INFO mapreduce.JobSubmitter: number of splits:2
14/04/04 18:00:03 WARN conf.Configuration: mapred.jar is deprecated. Instead, use mapreduce.job.jar
14/04/04 18:00:03 WARN conf.Configuration: mapred.cache.files is deprecated. Instead, use mapreduce.job.cache.files
14/04/04 18:00:03 WARN conf.Configuration: mapred.output.value.class is deprecated. Instead, use mapreduce.job.output.value.class
14/04/04 18:00:03 WARN conf.Configuration: mapred.mapoutput.value.class is deprecated. Instead, use mapreduce.map.output.value.class
14/04/04 18:00:03 WARN conf.Configuration: mapred.job.name is deprecated. Instead, use mapreduce.job.name
14/04/04 18:00:03 WARN conf.Configuration: mapred.input.dir is deprecated. Instead, use mapreduce.input.fileinputformat.inputdir
14/04/04 18:00:03 WARN conf.Configuration: mapred.output.dir is deprecated. Instead, use mapreduce.output.fileoutputformat.outputdir
14/04/04 18:00:03 WARN conf.Configuration: mapred.map.tasks is deprecated. Instead, use mapreduce.job.maps
14/04/04 18:00:03 WARN conf.Configuration: mapred.cache.files.timestamps is deprecated. Instead, use mapreduce.job.cache.files.timestamps
14/04/04 18:00:03 WARN conf.Configuration: mapred.output.key.class is deprecated. Instead, use mapreduce.job.output.key.class
14/04/04 18:00:03 WARN conf.Configuration: mapred.mapoutput.key.class is deprecated. Instead, use mapreduce.map.output.key.class
14/04/04 18:00:03 WARN conf.Configuration: mapred.working.dir is deprecated. Instead, use mapreduce.job.working.dir
14/04/04 18:00:03 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_1384839777050_0106
14/04/04 18:00:04 INFO client.YarnClientImpl: Submitted application application_1384839777050_0106 to ResourceManager at hdm3.gphd.local/172.28.9.252:8032
14/04/04 18:00:04 INFO mapreduce.Job: The url to track the job: http://hdm3.gphd.local:8088/proxy/application_1384839777050_0106/
14/04/04 18:00:04 INFO mapreduce.Job: Running job: job_1384839777050_0106
14/04/04 18:00:08 INFO mapreduce.Job: Job job_1384839777050_0106 running in uber mode : false
14/04/04 18:00:08 INFO mapreduce.Job: map 0% reduce 0%
14/04/04 18:00:12 INFO mapreduce.Job: Task Id : attempt_1384839777050_0106_m_000001_0, Status : FAILED
Error: java.lang.RuntimeException: PipeMapRed.waitOutputThreads(): subprocess failed with code 1
at org.apache.hadoop.streaming.PipeMapRed.waitOutputThreads(PipeMapRed.java:320)
at org.apache.hadoop.streaming.PipeMapRed.mapRedFinished(PipeMapRed.java:533)
at org.apache.hadoop.streaming.PipeMapper.close(PipeMapper.java:130)

我检查了 hadoop 作业中的错误日志,它提示找不到“import skimage”,这意味着它没有被数据节点拾取。

最佳答案

您是否尝试过 zipimport 解决方案?

这是一个例子:Hadoop: How to include third party library in Python MapReduce

关于带有未安装在数据节点上的导入包的 Python Hadoop 流式传输,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/22896763/

26 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com