gpt4 book ai didi

python - 通过使用 foreach 方法处理旧数据帧来创建新的 pyspark 数据帧时出现 Pickle 错误

转载 作者:行者123 更新时间:2023-12-04 07:56:07 29 4
gpt4 key购买 nike

给定一个 pyspark 数据框 given_df ,我需要用它来生成一个新的数据框 new_df从中。
我正在尝试使用 foreach() 逐行处理 pyspark 数据帧方法。让我们说,为简单起见,两个数据帧 given_dfnew_df由单列组成。
我必须处理此数据帧的每一行,并根据该单元格中存在的值,创建一些新行并将其添加到 new_df来自 union与 Rows 一起使用。处理单行 given_df 时将生成的行数是可变的。

new_df=spark.createDataFrame([], schema=['SampleField']) // Create an empty dataframe initially

given_df.foreach(func) // given_df already contains some data loaded. Now I run a function for each row.

def func(row):
rows_to_append = getNewRowsAfterProcessingCurrentRow(row)
global new_df // without this line, the next line will result in an error, because it will think that new_df is a local variable and we are trying to access it without defining it first.
new_df=new_df.union(spark.createDataFrame(data=rows_to_append, schema=['SampleField'])
然而,这会导致泡菜错误。
如果联合函数被注释掉,则不会发生错误。
PicklingError: Could not serialize object: Exception: It appears that you are attempting to reference SparkContext from a broadcast variable, action, or transformation. SparkContext can only be used on the driver, not in code that it run on workers. For more information, see SPARK-5063.

Traceback (most recent call last):
File "/databricks/spark/python/pyspark/serializers.py", line 476, in dumps
return cloudpickle.dumps(obj, pickle_protocol)
File "/databricks/spark/python/pyspark/cloudpickle.py", line 1097, in dumps
cp.dump(obj)
File "/databricks/spark/python/pyspark/cloudpickle.py", line 356, in dump
return Pickler.dump(self, obj)
File "/databricks/python/lib/python3.7/pickle.py", line 437, in dump
self.save(obj)
File "/databricks/python/lib/python3.7/pickle.py", line 504, in save
f(self, obj) # Call unbound method with explicit self
File "/databricks/python/lib/python3.7/pickle.py", line 789, in save_tuple
save(element)
File "/databricks/python/lib/python3.7/pickle.py", line 504, in save
f(self, obj) # Call unbound method with explicit self
File "/databricks/spark/python/pyspark/cloudpickle.py", line 500, in save_function
self.save_function_tuple(obj)
File "/databricks/spark/python/pyspark/cloudpickle.py", line 729, in save_function_tuple
save(state)
File "/databricks/python/lib/python3.7/pickle.py", line 504, in save
f(self, obj) # Call unbound method with explicit self
File "/databricks/python/lib/python3.7/pickle.py", line 859, in save_dict
self._batch_setitems(obj.items())
File "/databricks/python/lib/python3.7/pickle.py", line 885, in _batch_setitems
save(v)
File "/databricks/python/lib/python3.7/pickle.py", line 504, in save
f(self, obj) # Call unbound method with explicit self
File "/databricks/python/lib/python3.7/pickle.py", line 819, in save_list
self._batch_appends(obj)
File "/databricks/python/lib/python3.7/pickle.py", line 843, in _batch_appends
save(x)
File "/databricks/python/lib/python3.7/pickle.py", line 504, in save
f(self, obj) # Call unbound method with explicit self
File "/databricks/spark/python/pyspark/cloudpickle.py", line 500, in save_function
self.save_function_tuple(obj)
File "/databricks/spark/python/pyspark/cloudpickle.py", line 729, in save_function_tuple
save(state)
File "/databricks/python/lib/python3.7/pickle.py", line 504, in save
f(self, obj) # Call unbound method with explicit self
File "/databricks/python/lib/python3.7/pickle.py", line 859, in save_dict
self._batch_setitems(obj.items())
File "/databricks/python/lib/python3.7/pickle.py", line 885, in _batch_setitems
save(v)
File "/databricks/python/lib/python3.7/pickle.py", line 504, in save
f(self, obj) # Call unbound method with explicit self
File "/databricks/python/lib/python3.7/pickle.py", line 819, in save_list
self._batch_appends(obj)
File "/databricks/python/lib/python3.7/pickle.py", line 843, in _batch_appends
save(x)
File "/databricks/python/lib/python3.7/pickle.py", line 504, in save
f(self, obj) # Call unbound method with explicit self
File "/databricks/spark/python/pyspark/cloudpickle.py", line 500, in save_function
self.save_function_tuple(obj)
File "/databricks/spark/python/pyspark/cloudpickle.py", line 729, in save_function_tuple
save(state)
File "/databricks/python/lib/python3.7/pickle.py", line 504, in save
f(self, obj) # Call unbound method with explicit self
File "/databricks/python/lib/python3.7/pickle.py", line 859, in save_dict
self._batch_setitems(obj.items())
File "/databricks/python/lib/python3.7/pickle.py", line 885, in _batch_setitems
save(v)
File "/databricks/python/lib/python3.7/pickle.py", line 504, in save
f(self, obj) # Call unbound method with explicit self
File "/databricks/python/lib/python3.7/pickle.py", line 819, in save_list
self._batch_appends(obj)
File "/databricks/python/lib/python3.7/pickle.py", line 843, in _batch_appends
save(x)
File "/databricks/python/lib/python3.7/pickle.py", line 504, in save
f(self, obj) # Call unbound method with explicit self
File "/databricks/spark/python/pyspark/cloudpickle.py", line 500, in save_function
self.save_function_tuple(obj)
File "/databricks/spark/python/pyspark/cloudpickle.py", line 729, in save_function_tuple
save(state)
File "/databricks/python/lib/python3.7/pickle.py", line 504, in save
f(self, obj) # Call unbound method with explicit self
File "/databricks/python/lib/python3.7/pickle.py", line 859, in save_dict
self._batch_setitems(obj.items())
File "/databricks/python/lib/python3.7/pickle.py", line 885, in _batch_setitems
save(v)
File "/databricks/python/lib/python3.7/pickle.py", line 504, in save
f(self, obj) # Call unbound method with explicit self
File "/databricks/python/lib/python3.7/pickle.py", line 819, in save_list
self._batch_appends(obj)
File "/databricks/python/lib/python3.7/pickle.py", line 846, in _batch_appends
save(tmp[0])
File "/databricks/python/lib/python3.7/pickle.py", line 504, in save
f(self, obj) # Call unbound method with explicit self
File "/databricks/spark/python/pyspark/cloudpickle.py", line 500, in save_function
self.save_function_tuple(obj)
File "/databricks/spark/python/pyspark/cloudpickle.py", line 729, in save_function_tuple
save(state)
File "/databricks/python/lib/python3.7/pickle.py", line 504, in save
f(self, obj) # Call unbound method with explicit self
File "/databricks/python/lib/python3.7/pickle.py", line 859, in save_dict
self._batch_setitems(obj.items())
File "/databricks/python/lib/python3.7/pickle.py", line 885, in _batch_setitems
save(v)
File "/databricks/python/lib/python3.7/pickle.py", line 504, in save
f(self, obj) # Call unbound method with explicit self
File "/databricks/python/lib/python3.7/pickle.py", line 819, in save_list
self._batch_appends(obj)
File "/databricks/python/lib/python3.7/pickle.py", line 846, in _batch_appends
save(tmp[0])
File "/databricks/python/lib/python3.7/pickle.py", line 504, in save
f(self, obj) # Call unbound method with explicit self
File "/databricks/spark/python/pyspark/cloudpickle.py", line 500, in save_function
self.save_function_tuple(obj)
File "/databricks/spark/python/pyspark/cloudpickle.py", line 729, in save_function_tuple
save(state)
File "/databricks/python/lib/python3.7/pickle.py", line 504, in save
f(self, obj) # Call unbound method with explicit self
File "/databricks/python/lib/python3.7/pickle.py", line 859, in save_dict
self._batch_setitems(obj.items())
File "/databricks/python/lib/python3.7/pickle.py", line 885, in _batch_setitems
save(v)
File "/databricks/python/lib/python3.7/pickle.py", line 504, in save
f(self, obj) # Call unbound method with explicit self
File "/databricks/python/lib/python3.7/pickle.py", line 819, in save_list
self._batch_appends(obj)
File "/databricks/python/lib/python3.7/pickle.py", line 846, in _batch_appends
save(tmp[0])
File "/databricks/python/lib/python3.7/pickle.py", line 504, in save
f(self, obj) # Call unbound method with explicit self
File "/databricks/spark/python/pyspark/cloudpickle.py", line 495, in save_function
self.save_function_tuple(obj)
File "/databricks/spark/python/pyspark/cloudpickle.py", line 729, in save_function_tuple
save(state)
File "/databricks/python/lib/python3.7/pickle.py", line 504, in save
f(self, obj) # Call unbound method with explicit self
File "/databricks/python/lib/python3.7/pickle.py", line 859, in save_dict
self._batch_setitems(obj.items())
File "/databricks/python/lib/python3.7/pickle.py", line 885, in _batch_setitems
save(v)
File "/databricks/python/lib/python3.7/pickle.py", line 504, in save
f(self, obj) # Call unbound method with explicit self
File "/databricks/python/lib/python3.7/pickle.py", line 859, in save_dict
self._batch_setitems(obj.items())
File "/databricks/python/lib/python3.7/pickle.py", line 885, in _batch_setitems
save(v)
File "/databricks/python/lib/python3.7/pickle.py", line 549, in save
self.save_reduce(obj=obj, *rv)
File "/databricks/python/lib/python3.7/pickle.py", line 662, in save_reduce
save(state)
File "/databricks/python/lib/python3.7/pickle.py", line 504, in save
f(self, obj) # Call unbound method with explicit self
File "/databricks/python/lib/python3.7/pickle.py", line 859, in save_dict
self._batch_setitems(obj.items())
File "/databricks/python/lib/python3.7/pickle.py", line 885, in _batch_setitems
save(v)
File "/databricks/python/lib/python3.7/pickle.py", line 524, in save
rv = reduce(self.proto)
File "/databricks/spark/python/pyspark/context.py", line 356, in __getnewargs__
"It appears that you are attempting to reference SparkContext from a broadcast "
Exception: It appears that you are attempting to reference SparkContext from a broadcast variable, action, or transformation. SparkContext can only be used on the driver, not in code that it run on workers. For more information, see SPARK-5063.
为了更好地理解我想要做什么,让我举一个例子来说明一个可能的用例:
让我们说 given_df是一个句子的数据框,其中每个句子由一些由空格分隔的单词组成。
given_df=spark.createDataframe([("The old brown fox",), ("jumps over",), ("the lazy log",)], schema=["SampleField"])
new_df 是一个数据帧,由不同行的每个单词组成。所以我们将处理 given_df 的每一行根据我们通过分割行得到的单词,我们将把每一行插入到 new_df 中。 .
new_df=spark.createDataFrame([("The",), ("old",), ("brown",), ("fox",), ("jumps",), ("over",), ("the",), ("lazy",), ("dog",)], schema=["SampleField"])

最佳答案

您正在尝试在不允许的执行器上使用 DataFrame API,因此 PicklingError :

PicklingError: Could not serialize object: Exception: It appears that you are attempting to reference SparkContext from a broadcast variable, action, or transformation. SparkContext can only be used on the driver, not in code that it run on workers. For more information, see SPARK-5063.



你应该重写你的代码。例如,您可以使用 RDD.flatMap或者,如果您更喜欢 DataFrame API, explode()功能。
以下是使用后一种方法的方法:
given_df=spark.createDataFrame([("The old brown fox",), ("jumps over",), ("the lazy log",)], schema=["SampleField"])

from pyspark.sql.functions import udf, explode
from pyspark.sql.types import ArrayType, StringType

@udf(returnType=ArrayType(StringType()))
def getNewRowsAfterProcessingCurrentRow(str):
return str.split()

new_df= given_df\
.select(explode(getNewRowsAfterProcessingCurrentRow("SampleField")).alias("SampleField"))\
.unionAll(given_df)

new_df.show()

  • 你把你的 getNewRowsAfterProcessingCurrentRow()udf() .这只会使您的函数在 DataFrame API 中可用。
  • 然后,您使用包装在另一个名为 explode() 的函数中的函数。 .这是必需的,因为您想将拆分的句子“分解”(或转置)为多行,每行一个单词。
  • 最后,您将生成的 DataFrame 与原始 given_df 合并。 .

  • 输出:
    +-----------------+
    | SampleField|
    +-----------------+
    | The|
    | old|
    | brown|
    | fox|
    | jumps|
    | over|
    | the|
    | lazy|
    | log|
    |The old brown fox|
    | jumps over|
    | the lazy log|
    +-----------------+

    关于python - 通过使用 foreach 方法处理旧数据帧来创建新的 pyspark 数据帧时出现 Pickle 错误,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/66694369/

    29 4 0
    Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
    广告合作:1813099741@qq.com 6ren.com