gpt4 book ai didi

python - 在 PySpark 中使用 rdd.map 对字符串进行 Unpickling 和编码

转载 作者:可可西里 更新时间:2023-11-01 15:49:19 33 4
gpt4 key购买 nike

我需要将代码从 PySpark 1.3 移植到 2.3(也仅在 Python 2.7 上)并且我在 rdd 上有以下映射转换:

import cPickle as pickle
import base64

path = "my_filename"

my_rdd = "rdd with data" # pyspark.rdd.PipelinedRDD()

# saving RDD to a file but first encoding everything
my_rdd.map(lambda line: base64.b64encode(pickle.dumps(line))).saveAsTextFile(path)

# another my_rdd.map doing the opposite of the above, fails with the same error
my_rdd = sc.textFile(path).map(lambda line: pickle.loads(base64.b64decode(line)))

运行这部分时,出现以下错误:

   raise pickle.PicklingError(msg)
PicklingError: Could not serialize object: Exception: It appears that you are attempting to broadcast an RDD or reference an RDD from an action or transformation. RDD transformations and actions can only be invoked by the driver, not inside of other transformations; for example, rdd1.map(lambda x: rdd2.values.count() * x) is invalid because the values transformation and count action cannot be performed inside of the rdd1.map transformation. For more information, see SPARK-5063.

看起来 map 函数中不再允许这样的操作。有什么建议可以重写这部分吗?

更新:

奇怪的是,只是在做:

my_rdd.saveAsTextFile(path)

同样的错误也失败了。

最佳答案

归根结底,问题出在进行转换的函数深处。在这种情况下,重写比调试更容易。

关于python - 在 PySpark 中使用 rdd.map 对字符串进行 Unpickling 和编码,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/52333043/

33 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com