gpt4 book ai didi

python - Pyspark --py-files 不起作用

转载 作者:IT老高 更新时间:2023-10-28 21:12:55 26 4
gpt4 key购买 nike

我按照文档的建议使用它 http://spark.apache.org/docs/1.1.1/submitting-applications.html

spsark 版本 1.1.0

./spark/bin/spark-submit --py-files /home/hadoop/loganalysis/parser-src.zip \
/home/hadoop/loganalysis/ship-test.py

和代码中的conf:

conf = (SparkConf()
.setMaster("yarn-client")
.setAppName("LogAnalysis")
.set("spark.executor.memory", "1g")
.set("spark.executor.cores", "4")
.set("spark.executor.num", "2")
.set("spark.driver.memory", "4g")
.set("spark.kryoserializer.buffer.mb", "128"))

和从节点提示 ImportError

14/12/25 05:09:53 WARN scheduler.TaskSetManager: Lost task 0.0 in stage 0.0 (TID 0, ip-172-31-10-8.cn-north-1.compute.internal): org.apache.spark.api.python.PythonException: Traceback (most recent call last):
File "/home/hadoop/spark/python/pyspark/worker.py", line 75, in main
command = pickleSer._read_with_length(infile)
File "/home/hadoop/spark/python/pyspark/serializers.py", line 150, in _read_with_length
return self.loads(obj)
ImportError: No module named parser

并且 parser-src.zip 在本地测试。

[hadoop@ip-172-31-10-231 ~]$ python
Python 2.7.8 (default, Nov 3 2014, 10:17:30)
[GCC 4.8.2 20140120 (Red Hat 4.8.2-16)] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> import sys
>>> sys.path.insert(1, '/home/hadoop/loganalysis/parser-src.zip')
>>> from parser import parser
>>> parser.parse
<function parse at 0x7fa5ef4c9848>
>>>

我正在尝试获取有关远程工作人员的信息。看看它是否复制了文件。sys.path 是什么样子的。这很棘手。

更新:我用这个发现 zip 文件已发货。并设置了 sys.path。仍然导入得到错误。

data = list(range(4))
disdata = sc.parallelize(data)
result = disdata.map(lambda x: "sys.path: {0}\nDIR: {1} \n FILES: {2} \n parser: {3}".format(sys.path, os.getcwd(), os.listdir('.'), str(parser)))
result.collect()
print(result.take(4))

看来我必须深入研究 cloudpickle。这意味着我需要先了解 cloudpickle 如何工作和失败。

: An error occurred while calling o40.collect.
: org.apache.spark.SparkException: Job aborted due to stage failure: Task 4 in stage 0.0 failed 4 times, most recent failure: Lost task 4.3 in stage 0.0 (TID 23, ip-172-31-10-8.cn-north-1.compute.internal): org.apache.spark.api.python.PythonException: Traceback (most recent call last):
File "/home/hadoop/spark/python/pyspark/worker.py", line 75, in main
command = pickleSer._read_with_length(infile)
File "/home/hadoop/spark/python/pyspark/serializers.py", line 150, in _read_with_length
return self.loads(obj)
File "/home/hadoop/spark/python/pyspark/cloudpickle.py", line 811, in subimport
__import__(name)
ImportError: ('No module named parser', <function subimport at 0x7f219ffad7d0>, ('parser.parser',))

更新:

有人在 spark 0.8 中遇到同样的问题 http://apache-spark-user-list.1001560.n3.nabble.com/pyspark-Importing-other-py-files-in-PYTHONPATH-td2301.html

但他将他的库放在 python dist-packages 和导入工程中。我试过了,但仍然出现导入错误。

更新:

OH.gush .. 我认为问题是由于不理解 zip 文件和 python 导入行为引起的。我将 parser.py 传递给 --py-files,它可以工作,提示另一个依赖项。并且只压缩 .py 文件[不包括 .pyc] 似乎也可以。

但我不太明白为什么。

最佳答案

试试SparkContext

的这个功能
sc.addPyFile(path)

根据pyspark 文档here

Add a .py or .zip dependency for all tasks to be executed on this SparkContext in the future. The path passed can be either a local file, a file in HDFS (or other Hadoop-supported filesystems), or an HTTP, HTTPS or FTP URI.

尝试将您的 python 模块文件上传到公共(public)云存储(例如 AWS S3)并将 URL 传递给该方法。

这里有更全面的阅读 Material :http://www.cloudera.com/documentation/enterprise/5-5-x/topics/spark_python.html

关于python - Pyspark --py-files 不起作用,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/27644525/

26 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com