gpt4 book ai didi

python - 在 Spark 中广播用户定义的类

转载 作者:太空狗 更新时间:2023-10-30 01:32:20 25 4
gpt4 key购买 nike

我试图在 PySpark 应用程序中广播用户定义的变量,但我总是遇到以下错误:

 File "/usr/local/spark-2.1.0-bin-hadoop2.7/python/lib/pyspark.zip/pyspark/worker.py", line 174, in main
process()
File "/usr/local/spark-2.1.0-bin-hadoop2.7/python/lib/pyspark.zip/pyspark/worker.py", line 169, in process
serializer.dump_stream(func(split_index, iterator), outfile)
File "/usr/local/spark-2.1.0-bin-hadoop2.7/python/lib/pyspark.zip/pyspark/serializers.py", line 268, in dump_stream
vs = list(itertools.islice(iterator, batch))
File "/home/.../sparkbroad.py", line 29, in <lambda>
output = input_.map(lambda item: b.value.map(item))
File "/usr/local/spark-2.1.0-bin-hadoop2.7/python/lib/pyspark.zip/pyspark/broadcast.py", line 106, in value
self._value = self.load(self._path)
File "/usr/local/spark-2.1.0-bin-hadoop2.7/python/lib/pyspark.zip/pyspark/broadcast.py", line 97, in load
return pickle.load(f)

AttributeError: 'module' object has no attribute 'FooMap'

模块 sparkbrad.py 中的代码如下:

import random
import pyspark as spark

class FooMap(object):

def __init__(self):
keys = list(range(10))
values = [2 * key for key in keys]
self._map = dict(zip(keys, values))

def map(self, value):
if value not in self._map:
return -1
return self._map[value]


class FooMapJob(object):

def __init__(self, inputs):
self._inputs = inputs
self._foomap = FooMap()

def run(self):
sc = spark.SparkContext('local', 'FooMap')
input_ = sc.parallelize(self._inputs, 4)
b = sc.broadcast(self._foomap)
output = input_.map(lambda item: b.value.map(item))
b.unpersist()
result = list(output.toLocalIterator())
sc.stop()
return result


def main():
inputs = [random.randint(0, 10) for _ in range(10)]
job = FooMapJob(inputs)
print(job.run())

if __name__ == '__main__':
main()

我通过以下方式运行它:

:~$ spark-submit --master local[4] --py-files sparkbroad.py sparkbroad.py

我在其中添加了 --py-files 参数,但看起来变化不大。不幸的是,我在网上找不到任何处理复杂类广播的例子(只是列表或字典)。任何提示表示赞赏。提前致谢。

更新:将FooMap 类放在一个单独的模块中,一切似乎都工作正常,即使没有--py-files 指令。

最佳答案

FooMap 类放在一个单独的模块中,一切正常。

关于python - 在 Spark 中广播用户定义的类,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/43042241/

25 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com