gpt4 book ai didi

python - 如何创建一个 UDF 来创建新列并修改现有列

转载 作者:行者123 更新时间:2023-12-01 02:35:51 24 4
gpt4 key购买 nike

我有一个像这样的数据框:

id | color
---| -----
1 | red-dark
2 | green-light
3 | red-light
4 | blue-sky
5 | green-dark

我想创建一个 UDF,使我的数据框变为:

id | color | shade
---| ----- | -----
1 | red | dark
2 | green | light
3 | red | light
4 | blue | sky
5 | green | dark

我为此编写了一个 UDF:

def my_function(data_str):
return ",".join(data_str.split("-"))

my_function_udf = udf(my_function, StringType())

#apply the UDF

df = df.withColumn("shade", my_function_udf(df['color']))

但是,这并没有按照我的预期转换数据帧。相反,它把它变成:

id | color      | shade
---| ---------- | -----
1 | red-dark | red,dark
2 | green-dark | green,light
3 | red-light | red,light
4 | blue-sky | blue,sky
5 | green-dark | green,dark

如何在 pyspark 中按照我想要的方式转换数据帧?

根据建议的问题进行尝试

schema = ArrayType(StructType([
StructField("color", StringType(), False),
StructField("shade", StringType(), False)
]))

color_shade_udf = udf(
lambda s: [tuple(s.split("-"))],
schema
)

df = df.withColumn("colorshade", color_shade_udf(df['color']))

#Gives the following

id | color | colorshade
---| ---------- | -----
1 | red-dark | [{"color":"red","shade":"dark"}]
2 | green-dark | [{"color":"green","shade":"dark"}]
3 | red-light | [{"color":"red","shade":"light"}]
4 | blue-sky | [{"color":"blue","shade":"sky"}]
5 | green-dark | [{"color":"green","shade":"dark"}]

我感觉离我越来越近了

最佳答案

您可以使用内置函数split():

from pyspark.sql.functions import split, col

df.withColumn("arr", split(df.color, "\\-")) \
.select("id",
col("arr")[0].alias("color"),
col("arr")[1].alias("shade")) \
.drop("arr") \
.show()
+---+-----+-----+
| id|color|shade|
+---+-----+-----+
| 1| red| dark|
| 2|green|light|
| 3| red|light|
| 4| blue| sky|
| 5|green| dark|
+---+-----+-----+

关于python - 如何创建一个 UDF 来创建新列并修改现有列,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/46239949/

24 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com