gpt4 book ai didi

python - pyspark 分组映射 IllegalArgumentException 错误

转载 作者:行者123 更新时间:2023-12-05 07:15:39 25 4
gpt4 key购买 nike

我无法让 GROUPED_MAP 在 pyspark 中工作。我试过使用示例代码,包括一些来自 spark git repo 的代码,但没有成功。对于我需要更改的任何建议,我们将不胜感激。

例如:

from pyspark.sql import SparkSession
from pyspark.sql.utils import require_minimum_pandas_version, require_minimum_pyarrow_version

require_minimum_pandas_version()
require_minimum_pyarrow_version()


from pyspark.sql.functions import pandas_udf, PandasUDFType
spark = SparkSession.builder.master("local[*]").getOrCreate()
df = spark.createDataFrame(
[(1, 1.0), (1, 2.0), (2, 3.0), (2, 5.0), (2, 10.0)],
("id", "v"))

@pandas_udf("id long, v double", PandasUDFType.GROUPED_MAP)
def subtract_mean(pdf):
# pdf is a pandas.DataFrame
v = pdf.v
return pdf.assign(v=v - v.mean())

df.groupby("id").apply(subtract_mean).show()

给我错误:


py4j.protocol.Py4JJavaError: An error occurred while calling o61.showString.
: org.apache.spark.SparkException: Job aborted due to stage failure: Task 44 in stage 7.0 failed 1 times, most recent failure: Lost task 44.0 in stage 7.0 (TID 128, localhost, executor driver): java.lang.IllegalArgumentException

我相信 pyspark 设置正确,因为它对我来说运行成功:

from pyspark.sql.functions import udf, struct, col
from pyspark.sql.types import *
from pyspark.sql import SparkSession
import pyspark.sql.functions as func
import pandas as pd


spark = SparkSession.builder.master("local[*]").getOrCreate()

def sum_diff(f1, f2):
return [f1 + f2, f1-f2]

schema = StructType([
StructField("sum", FloatType(), False),
StructField("diff", FloatType(), False)
])

sum_diff_udf = udf(lambda row: sum_diff(row[0], row[1]), schema)

df = spark.createDataFrame(pd.DataFrame([[1., 2.], [2., 4.]], columns=['f1', 'f2']))

df_new = df.withColumn("sum_diff", sum_diff_udf(struct([col('f1'), col('f2')])))\
.select('*', 'sum_diff.*')
df_new.show()

最佳答案

我遇到了同样的问题。对我来说,它是通过使用推荐版本的 PyArrow (0.15.1) 并在 conf/spark-env.sh 中设置环境变量来解决的,因为我使用的是 Spark 2.4.x:

ARROW_PRE_0_15_IPC_FORMAT=1

查看完整说明 here .请注意,对于 Windows,您需要将 conf/spark-env.sh 重命名为 conf/spark-env.cmd,因为它不会获取 bash 脚本。在这种情况下,环境变量是:

set ARROW_PRE_0_15_IPC_FORMAT=1

关于python - pyspark 分组映射 IllegalArgumentException 错误,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/59546728/

25 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com