gpt4 book ai didi

python - Pandas 标量 UDF 失败,IllegalArgumentException

转载 作者:行者123 更新时间:2023-12-01 00:27:23 39 4
gpt4 key购买 nike

首先,如果我的问题很简单,我深表歉意。我确实花了很多时间研究它。

我正在尝试在 PySpark 脚本中设置标量 Pandas UDF,如所述 here .

这是我的代码:

from pyspark import SparkContext
from pyspark.sql import functions as F
from pyspark.sql.types import *
from pyspark.sql import SQLContext
sc.install_pypi_package("pandas")
import pandas as pd
sc.install_pypi_package("PyArrow")

df = spark.createDataFrame(
[("a", 1, 0), ("a", -1, 42), ("b", 3, -1), ("b", 10, -2)],
("key", "value1", "value2")
)

df.show()

@F.pandas_udf("double", F.PandasUDFType.SCALAR)
def pandas_plus_one(v):
return pd.Series(v + 1)

df.select(pandas_plus_one(df.value1)).show()
# Also fails
#df.select(pandas_plus_one(df["value1"])).show()
#df.select(pandas_plus_one("value1")).show()
#df.select(pandas_plus_one(F.col("value1"))).show()

脚本在最后一条语句处失败:

An error occurred while calling o209.showString. : org.apache.spark.SparkException: Job aborted due to stage failure: Task 2 in stage 8.0 failed 4 times, most recent failure: Lost task 2.3 in stage 8.0 (TID 30, ip-10-160-2-53.ec2.internal, executor 3): java.lang.IllegalArgumentException at java.nio.ByteBuffer.allocate(ByteBuffer.java:334) at org.apache.arrow.vector.ipc.message.MessageSerializer.readMessage(MessageSerializer.java:543) at org.apache.arrow.vector.ipc.message.MessageChannelReader.readNext(MessageChannelReader.java:58) at org.apache.arrow.vector.ipc.ArrowStreamReader.readSchema(ArrowStreamReader.java:132) at org.apache.arrow.vector.ipc.ArrowReader.initialize(ArrowReader.java:181) at org.apache.arrow.vector.ipc.ArrowReader.ensureInitialized(ArrowReader.java:172) at org.apache.arrow.vector.ipc.ArrowReader.getVectorSchemaRoot(ArrowReader.java:65) at org.apache.spark.sql.execution.python.ArrowPythonRunner$$anon$1.read(ArrowPythonRunner.scala:162) at org.apache.spark.sql.execution.python.ArrowPythonRunner$$anon$1.read(ArrowPythonRunner.scala:122) at org.apache.spark.api.python.BasePythonRunner$ReaderIterator.hasNext(PythonRunner.scala:410) ...

我在这里缺少什么?我只是按照手册操作。感谢您的帮助

最佳答案

Pyarrow 于 2019 年 10 月 5 日推出了新版本 0.15,导致 pandas Udf 抛出错误。Spark 需要升级才能与之兼容(这可能需要一些时间)。您可以在这里关注进度https://issues.apache.org/jira/projects/SPARK/issues/SPARK-29367?filter=allissues

解决方案:

  1. 您需要安装 Pyarrow 0.14.1 或更低版本。 < sc.install_pypi_package("pyarrow==0.14.1") >(或)
  2. 在使用 Python 的位置设置环境变量ARROW_PRE_0_15_IPC_FORMAT=1

关于python - Pandas 标量 UDF 失败,IllegalArgumentException,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/58458415/

39 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com