- html - 出于某种原因,IE8 对我的 Sass 文件中继承的 html5 CSS 不友好?
- JMeter 在响应断言中使用 span 标签的问题
- html - 在 :hover and :active? 上具有不同效果的 CSS 动画
- html - 相对于居中的 html 内容固定的 CSS 重复背景?
给定一个 pyspark 数据框 given_df
,我需要用它来生成一个新的数据框 new_df
从中。
我正在尝试使用 foreach()
逐行处理 pyspark 数据帧方法。让我们说,为简单起见,两个数据帧 given_df
和 new_df
由单列组成。
我必须处理此数据帧的每一行,并根据该单元格中存在的值,创建一些新行并将其添加到 new_df
来自 union
与 Rows 一起使用。处理单行 given_df
时将生成的行数是可变的。
new_df=spark.createDataFrame([], schema=['SampleField']) // Create an empty dataframe initially
given_df.foreach(func) // given_df already contains some data loaded. Now I run a function for each row.
def func(row):
rows_to_append = getNewRowsAfterProcessingCurrentRow(row)
global new_df // without this line, the next line will result in an error, because it will think that new_df is a local variable and we are trying to access it without defining it first.
new_df=new_df.union(spark.createDataFrame(data=rows_to_append, schema=['SampleField'])
然而,这会导致泡菜错误。
PicklingError: Could not serialize object: Exception: It appears that you are attempting to reference SparkContext from a broadcast variable, action, or transformation. SparkContext can only be used on the driver, not in code that it run on workers. For more information, see SPARK-5063.
Traceback (most recent call last):
File "/databricks/spark/python/pyspark/serializers.py", line 476, in dumps
return cloudpickle.dumps(obj, pickle_protocol)
File "/databricks/spark/python/pyspark/cloudpickle.py", line 1097, in dumps
cp.dump(obj)
File "/databricks/spark/python/pyspark/cloudpickle.py", line 356, in dump
return Pickler.dump(self, obj)
File "/databricks/python/lib/python3.7/pickle.py", line 437, in dump
self.save(obj)
File "/databricks/python/lib/python3.7/pickle.py", line 504, in save
f(self, obj) # Call unbound method with explicit self
File "/databricks/python/lib/python3.7/pickle.py", line 789, in save_tuple
save(element)
File "/databricks/python/lib/python3.7/pickle.py", line 504, in save
f(self, obj) # Call unbound method with explicit self
File "/databricks/spark/python/pyspark/cloudpickle.py", line 500, in save_function
self.save_function_tuple(obj)
File "/databricks/spark/python/pyspark/cloudpickle.py", line 729, in save_function_tuple
save(state)
File "/databricks/python/lib/python3.7/pickle.py", line 504, in save
f(self, obj) # Call unbound method with explicit self
File "/databricks/python/lib/python3.7/pickle.py", line 859, in save_dict
self._batch_setitems(obj.items())
File "/databricks/python/lib/python3.7/pickle.py", line 885, in _batch_setitems
save(v)
File "/databricks/python/lib/python3.7/pickle.py", line 504, in save
f(self, obj) # Call unbound method with explicit self
File "/databricks/python/lib/python3.7/pickle.py", line 819, in save_list
self._batch_appends(obj)
File "/databricks/python/lib/python3.7/pickle.py", line 843, in _batch_appends
save(x)
File "/databricks/python/lib/python3.7/pickle.py", line 504, in save
f(self, obj) # Call unbound method with explicit self
File "/databricks/spark/python/pyspark/cloudpickle.py", line 500, in save_function
self.save_function_tuple(obj)
File "/databricks/spark/python/pyspark/cloudpickle.py", line 729, in save_function_tuple
save(state)
File "/databricks/python/lib/python3.7/pickle.py", line 504, in save
f(self, obj) # Call unbound method with explicit self
File "/databricks/python/lib/python3.7/pickle.py", line 859, in save_dict
self._batch_setitems(obj.items())
File "/databricks/python/lib/python3.7/pickle.py", line 885, in _batch_setitems
save(v)
File "/databricks/python/lib/python3.7/pickle.py", line 504, in save
f(self, obj) # Call unbound method with explicit self
File "/databricks/python/lib/python3.7/pickle.py", line 819, in save_list
self._batch_appends(obj)
File "/databricks/python/lib/python3.7/pickle.py", line 843, in _batch_appends
save(x)
File "/databricks/python/lib/python3.7/pickle.py", line 504, in save
f(self, obj) # Call unbound method with explicit self
File "/databricks/spark/python/pyspark/cloudpickle.py", line 500, in save_function
self.save_function_tuple(obj)
File "/databricks/spark/python/pyspark/cloudpickle.py", line 729, in save_function_tuple
save(state)
File "/databricks/python/lib/python3.7/pickle.py", line 504, in save
f(self, obj) # Call unbound method with explicit self
File "/databricks/python/lib/python3.7/pickle.py", line 859, in save_dict
self._batch_setitems(obj.items())
File "/databricks/python/lib/python3.7/pickle.py", line 885, in _batch_setitems
save(v)
File "/databricks/python/lib/python3.7/pickle.py", line 504, in save
f(self, obj) # Call unbound method with explicit self
File "/databricks/python/lib/python3.7/pickle.py", line 819, in save_list
self._batch_appends(obj)
File "/databricks/python/lib/python3.7/pickle.py", line 843, in _batch_appends
save(x)
File "/databricks/python/lib/python3.7/pickle.py", line 504, in save
f(self, obj) # Call unbound method with explicit self
File "/databricks/spark/python/pyspark/cloudpickle.py", line 500, in save_function
self.save_function_tuple(obj)
File "/databricks/spark/python/pyspark/cloudpickle.py", line 729, in save_function_tuple
save(state)
File "/databricks/python/lib/python3.7/pickle.py", line 504, in save
f(self, obj) # Call unbound method with explicit self
File "/databricks/python/lib/python3.7/pickle.py", line 859, in save_dict
self._batch_setitems(obj.items())
File "/databricks/python/lib/python3.7/pickle.py", line 885, in _batch_setitems
save(v)
File "/databricks/python/lib/python3.7/pickle.py", line 504, in save
f(self, obj) # Call unbound method with explicit self
File "/databricks/python/lib/python3.7/pickle.py", line 819, in save_list
self._batch_appends(obj)
File "/databricks/python/lib/python3.7/pickle.py", line 846, in _batch_appends
save(tmp[0])
File "/databricks/python/lib/python3.7/pickle.py", line 504, in save
f(self, obj) # Call unbound method with explicit self
File "/databricks/spark/python/pyspark/cloudpickle.py", line 500, in save_function
self.save_function_tuple(obj)
File "/databricks/spark/python/pyspark/cloudpickle.py", line 729, in save_function_tuple
save(state)
File "/databricks/python/lib/python3.7/pickle.py", line 504, in save
f(self, obj) # Call unbound method with explicit self
File "/databricks/python/lib/python3.7/pickle.py", line 859, in save_dict
self._batch_setitems(obj.items())
File "/databricks/python/lib/python3.7/pickle.py", line 885, in _batch_setitems
save(v)
File "/databricks/python/lib/python3.7/pickle.py", line 504, in save
f(self, obj) # Call unbound method with explicit self
File "/databricks/python/lib/python3.7/pickle.py", line 819, in save_list
self._batch_appends(obj)
File "/databricks/python/lib/python3.7/pickle.py", line 846, in _batch_appends
save(tmp[0])
File "/databricks/python/lib/python3.7/pickle.py", line 504, in save
f(self, obj) # Call unbound method with explicit self
File "/databricks/spark/python/pyspark/cloudpickle.py", line 500, in save_function
self.save_function_tuple(obj)
File "/databricks/spark/python/pyspark/cloudpickle.py", line 729, in save_function_tuple
save(state)
File "/databricks/python/lib/python3.7/pickle.py", line 504, in save
f(self, obj) # Call unbound method with explicit self
File "/databricks/python/lib/python3.7/pickle.py", line 859, in save_dict
self._batch_setitems(obj.items())
File "/databricks/python/lib/python3.7/pickle.py", line 885, in _batch_setitems
save(v)
File "/databricks/python/lib/python3.7/pickle.py", line 504, in save
f(self, obj) # Call unbound method with explicit self
File "/databricks/python/lib/python3.7/pickle.py", line 819, in save_list
self._batch_appends(obj)
File "/databricks/python/lib/python3.7/pickle.py", line 846, in _batch_appends
save(tmp[0])
File "/databricks/python/lib/python3.7/pickle.py", line 504, in save
f(self, obj) # Call unbound method with explicit self
File "/databricks/spark/python/pyspark/cloudpickle.py", line 495, in save_function
self.save_function_tuple(obj)
File "/databricks/spark/python/pyspark/cloudpickle.py", line 729, in save_function_tuple
save(state)
File "/databricks/python/lib/python3.7/pickle.py", line 504, in save
f(self, obj) # Call unbound method with explicit self
File "/databricks/python/lib/python3.7/pickle.py", line 859, in save_dict
self._batch_setitems(obj.items())
File "/databricks/python/lib/python3.7/pickle.py", line 885, in _batch_setitems
save(v)
File "/databricks/python/lib/python3.7/pickle.py", line 504, in save
f(self, obj) # Call unbound method with explicit self
File "/databricks/python/lib/python3.7/pickle.py", line 859, in save_dict
self._batch_setitems(obj.items())
File "/databricks/python/lib/python3.7/pickle.py", line 885, in _batch_setitems
save(v)
File "/databricks/python/lib/python3.7/pickle.py", line 549, in save
self.save_reduce(obj=obj, *rv)
File "/databricks/python/lib/python3.7/pickle.py", line 662, in save_reduce
save(state)
File "/databricks/python/lib/python3.7/pickle.py", line 504, in save
f(self, obj) # Call unbound method with explicit self
File "/databricks/python/lib/python3.7/pickle.py", line 859, in save_dict
self._batch_setitems(obj.items())
File "/databricks/python/lib/python3.7/pickle.py", line 885, in _batch_setitems
save(v)
File "/databricks/python/lib/python3.7/pickle.py", line 524, in save
rv = reduce(self.proto)
File "/databricks/spark/python/pyspark/context.py", line 356, in __getnewargs__
"It appears that you are attempting to reference SparkContext from a broadcast "
Exception: It appears that you are attempting to reference SparkContext from a broadcast variable, action, or transformation. SparkContext can only be used on the driver, not in code that it run on workers. For more information, see SPARK-5063.
为了更好地理解我想要做什么,让我举一个例子来说明一个可能的用例:
given_df
是一个句子的数据框,其中每个句子由一些由空格分隔的单词组成。
given_df=spark.createDataframe([("The old brown fox",), ("jumps over",), ("the lazy log",)], schema=["SampleField"])
new_df 是一个数据帧,由不同行的每个单词组成。所以我们将处理
given_df
的每一行根据我们通过分割行得到的单词,我们将把每一行插入到
new_df
中。 .
new_df=spark.createDataFrame([("The",), ("old",), ("brown",), ("fox",), ("jumps",), ("over",), ("the",), ("lazy",), ("dog",)], schema=["SampleField"])
最佳答案
您正在尝试在不允许的执行器上使用 DataFrame API,因此 PicklingError
:
PicklingError: Could not serialize object: Exception: It appears that you are attempting to reference SparkContext from a broadcast variable, action, or transformation. SparkContext can only be used on the driver, not in code that it run on workers. For more information, see SPARK-5063.
RDD.flatMap
或者,如果您更喜欢 DataFrame API,
explode()
功能。
given_df=spark.createDataFrame([("The old brown fox",), ("jumps over",), ("the lazy log",)], schema=["SampleField"])
from pyspark.sql.functions import udf, explode
from pyspark.sql.types import ArrayType, StringType
@udf(returnType=ArrayType(StringType()))
def getNewRowsAfterProcessingCurrentRow(str):
return str.split()
new_df= given_df\
.select(explode(getNewRowsAfterProcessingCurrentRow("SampleField")).alias("SampleField"))\
.unionAll(given_df)
new_df.show()
getNewRowsAfterProcessingCurrentRow()
在 udf()
.这只会使您的函数在 DataFrame API 中可用。 explode()
的函数中的函数。 .这是必需的,因为您想将拆分的句子“分解”(或转置)为多行,每行一个单词。 given_df
合并。 . +-----------------+
| SampleField|
+-----------------+
| The|
| old|
| brown|
| fox|
| jumps|
| over|
| the|
| lazy|
| log|
|The old brown fox|
| jumps over|
| the lazy log|
+-----------------+
关于python - 通过使用 foreach 方法处理旧数据帧来创建新的 pyspark 数据帧时出现 Pickle 错误,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/66694369/
我遇到了一个奇怪的问题。我的应用程序的 Release 版本似乎运行良好,但最近当我切换到 Debug 版本时,我在启动时立即遇到访问冲突。当释放分配的内存块时,就会发生访问冲突。所有这些都发生在静态
我在 C# 中偶然发现了这种奇怪的语法形式,并试图弄清楚它的含义以及如何使用它。网络上似乎没有关于此的任何文档。 object data = new { var1 = someValue, var2
我正在尝试使用浏览器的内置类型 CSSStyleDeclaration 以编程方式传递和修改样式(由于 .cssText 属性,这很方便)。 但是,new CSSStyleDeclaration()
我有现成的代码: internal bool firstAsSymbol(out Symbol s) { return (s = first as Symbol) !=
在新的 Eclipse 版本 2022-03 中,一些(但不是全部)java 项目在 Project Explorer View 中的外观发生了变化。尽管 Package Presentation 设
我正在尝试使用 FormData 通过获取 API 在 POST 请求中发送用户输入的数据。问题是,当我用我创建的表单创建一个新的 FormData 对象时,它一直在创建一个空对象——没有条目/键/值
我有一个用一些 intel-intrinsincs 编写的 C 代码。在我先用 avx 然后用 ssse3 标志编译后,我得到了两个完全不同的汇编代码。例如: AVX: vpunpckhbw %xm
最近,discord 为您自己的应用程序添加了对斜杠命令的支持。我通读了它的文档,并尝试搜索一些视频(但是该功能刚刚出现),但我不明白我实际上需要做什么才能使其正常工作。我正在使用 WebStorm(
我想使用 JRI 从 Java 调用 R。 我在 eclipse 下在主类中运行它: Rengine c = new Rengine(new String[] { "--vanilla" },
我正在使用新的 Place Autocomplete那是来自新的静态Google Places SDK 客户端库 (here)。所以它真的很容易使用,我刚得到this tutorial它按预期工作。
我刚刚更新到 flutter 版本 1.25.0-5.0.pre.92,我的代码中出现了很多与空安全相关的错误,这些错误以前运行良好。我没有以任何方式选择空安全,我所做的只是运行 flutter 升级
我已经使用 React Native 有一段时间了,但我想我会在网络上试用 React。所以我遵循了这个指南:https://reactjs.org/docs/create-a-new-react-a
周六早上在这里。尝试学习新的 Scala 编译器 dotty。 安装在我的 Mac 上使用 brew install lampepfl/brew/dotty 安装成功。我有版本 dotr -versi
我使用了谷歌地方的新依赖。单击自动完成 View 时应用程序崩溃。错误如下。, java.lang.NullPointerException: Place Fields must be set.
我关注了这个博客-> https://medium.com/@teyou21/training-your-object-detection-model-on-tensorflow-part-2-e9e
在哪里可以找到用于在此架构上进行组装的新寄存器的名称? 我指的是 X86 中的寄存器,如 EAX、ESP、EBX 等。但我希望它们是 64 位的。 我认为它们与我反汇编 C 代码时不同,我得到的是 r
新的服务总线库 Azure.Messaging.ServiceBus 使用 ServiceBusReceivedMessage 来接收消息 https://learn.microsoft.com/en
需要使用实时流媒体 channel 的实时编码类型在新的 Azure 门户中配置广告插入和石板图像。请帮忙解决这个问题,因为我找不到该功能。 最佳答案 此处描述了 Azure 媒体服务的广告插入选项
我正在使用新的 GitHub 操作,下面的工作流程的想法是在打开或同步 pr 时运行,它应该首先检查并安装依赖项,然后运行一些 yarn 脚本 name: PR to Master on: pul
我听说 DMD 2.058 中将有一个用于匿名函数的新语法,但我找不到任何相关信息。新语法是什么?旧语法是否会被弃用? 最佳答案 我相信它就像 C#'s . 以下内容是等效的: delegate(i,
我是一名优秀的程序员,十分优秀!