gpt4 book ai didi

python-2.7 - 如何在 pyspark 中创建具有两个输入的 UDF

转载 作者:行者123 更新时间:2023-12-03 09:08:19 28 4
gpt4 key购买 nike

我是 pyspark 的新手,我正在尝试创建一个简单的 udf,它必须采用两个输入列,检查第二列是否有空格,如果有,则将第一个列拆分为两个值并覆盖原始列。这就是我所做的:

def split(x, y):
if x == "EXDRA" and y == "":
return ("EXT", "DCHA")
if x == "EXIZQ" and y == "":
return ("EXT", "IZDA")

udf_split = udf(split, ArrayType())

df = df \
.withColumn("x", udf_split(df['x'], df['y'])[1]) \
.withColumn("y", udf_split(df['x'], df['y'])[0])

但是当我运行此代码时,出现以下错误:

File "<stdin>", line 1, in <module>
TypeError: __init__() takes at least 2 arguments (1 given)

我做错了什么?

谢谢你,阿尔瓦罗

最佳答案

我不确定你想做什么,但根据我的理解,我会这样做:

from pyspark.sql.types import *
from pyspark.sql.functions import udf, col

def split(x, y):
if x == "EXDRA" and y == "":
return ("EXT", "DCHA")
if x == "EXIZQ" and y == "":
return ("EXT", "IZDA")

schema = StructType([StructField("x1", StringType(), False), StructField("y1", StringType(), False)])
udf_split = udf(split, schema)

df = spark.createDataFrame([("EXDRA", ""), ("EXIZQ", ""), ("", "foo")], ("x", "y"))

df.show()

# +-----+---+
# | x| y|
# +-----+---+
# |EXDRA| |
# |EXIZQ| |
# | |foo|
# +-----+---+

df = df \
.withColumn("split", udf_split(df['x'], df['y'])) \
.withColumn("x", col("split.x1")) \
.withColumn("y", col("split.y1"))

df.printSchema()

# root
# |-- x: string (nullable = true)
# |-- y: string (nullable = true)
# |-- split: struct (nullable = true)
# | |-- x1: string (nullable = false)
# | |-- y1: string (nullable = false)


df.show()

# +----+----+----------+
# | x| y| split|
# +----+----+----------+
# | EXT|DCHA|[EXT,DCHA]|
# | EXT|IZDA|[EXT,IZDA]|
# |null|null| null|
# +----+----+----------+

关于python-2.7 - 如何在 pyspark 中创建具有两个输入的 UDF,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/45029113/

28 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com