Pandas udf 遍历 PySpark 数据帧行-6ren

Pandas udf 遍历 PySpark 数据帧行

转载作者：行者123 更新时间：2023-12-05 01:31:29

我正在尝试使用 pandas_udf，因为我的数据位于 PySpark 数据框中，但我想使用 pandas 库。我有很多行，所以我无法将我的 PySpark 数据帧转换为 Pandas 数据帧。

我使用 textdistance (pip3 install textdistance)并导入它:import textdistance。

test = spark.createDataFrame(
    [('dog cat', 'dog cat'), 
     ('cup dad', 'mug'),],
    ['value1', 'value2']
)

@pandas_udf('float', PandasUDFType.SCALAR)
def textdistance_jaro_winkler(a, b):
    return textdistance.jaro_winkler(a, b)

test = test.withColumn('jaro_winkler', textdistance_jaro_winkler(col('value1'), col('value2')))
test.show()

我收到以下获取错误:

ValueError: The truth value of a Series is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all().

我试图将整个数据帧作为函数中的参数传递，并在函数中传递字符串值，但我认为这使情况变得更糟:

schema = StructType([StructField("value1", StringType(), True)
                     ,StructField("value2", StringType(), True)
                     ,StructField("jaro_winkler", FloatType(), True)
                    ])

@pandas_udf(schema, PandasUDFType.GROUPED_MAP)
def textdistance_jaro_winkler(df):
    df['jaro_winkler'] = df.apply(lambda x: textdistance.jaro_winkler(x['value1'],  x['value2']))
    
    return df

最佳答案

您需要重写函数才能使用 pandas UDF Series to Series :

import pandas as pd
import textdistance
from pyspark.sql import functions as F

def textdistance_jaro_winkler(a: pd.Series, b: pd.Series) -> pd.Series:
    return pd.Series([textdistance.jaro_winkler(x, y) for x, y in zip(a, b)])


jaro_winkler_udf = F.pandas_udf(textdistance_jaro_winkler, returnType=FloatType())

test = test.withColumn('jaro_winkler', jaro_winkler_udf(col('value1'), col('value2')))
test.show()

#+-------+-------+------------+
#| value1| value2|jaro_winkler|
#+-------+-------+------------+
#|dog cat|dog cat|         1.0|
#|cup dad|    mug|   0.4920635|
#+-------+-------+------------+

关于Pandas udf 遍历 PySpark 数据帧行，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/66174399/

文章推荐： google-maps - Google Maps API - 慢速最大缩放服务

文章推荐： odoo - 创建记录时，如何停止向关注者自动发送电子邮件？

文章推荐： python - 指定方法的参数类型和返回类型有什么好处？

行者123

个人简介

我是一名优秀的程序员,十分优秀！

作者热门文章

滴滴打车优惠券免费领取

全站热门文章

首页

博学

6Ren·AI

商城

Pandas udf 遍历 PySpark 数据帧行