gpt4 book ai didi

dataframe - 如何在pyspark中连接两个数组

转载 作者:行者123 更新时间:2023-12-01 21:54:45 25 4
gpt4 key购买 nike

我有一个 pyspark 数据框。

例子:

ID   |    phone   |  name <array>  | age <array>
-------------------------------------------------
12 | 827556 | ['AB','AA'] | ['CC']
-------------------------------------------------
45 | 87346 | null | ['DD']
-------------------------------------------------
56 | 98356 | ['FF'] | null
-------------------------------------------------
34 | 87345 | ['AA','BB'] | ['BB']

我想连接 2 个数组名称和年龄。我是这样做的:

df = df.withColumn("new_column", F.concat(df.name, df.age))
df = df.select("ID", "phone", "new_column")

但我缺少一些列,似乎是 concat function适用于 String 而不是数组并删除重复项:

预期结果:

ID   |    phone   |  new_column <array>  
----------------------------------------
12 | 827556 | ['AB','AA','CC']
----------------------------------------
45 | 87346 | ['DD']
----------------------------------------
56 | 98356 | ['FF']
----------------------------------------
34 | 87345 | ['AA','BB']
----------------------------------------

知道我正在使用 Spark version < 2.4 后,如何在 pyspark 中连接 2 个数组?

谢谢

最佳答案

您也可以使用 selectExpr

testdata = [(0, ['AB','AA'],  ['CC']), (1, None, ['DD']), (2,  ['FF'] ,None), (3,  ['AA','BB'] , ['BB'])]
df = spark.createDataFrame(testdata, ['id', 'name', 'age'])

>>> df.show()
+---+--------+----+
| id| name| age|
+---+--------+----+
| 0|[AB, AA]|[CC]|
| 1| null|[DD]|
| 2| [FF]|null|
| 3|[AA, BB]|[BB]|
+---+--------+----+

>>> df.selectExpr('''array(concat_ws(',',name,age)) as joined''').show()
+----------+
| joined|
+----------+
|[AB,AA,CC]|
| [DD]|
| [FF]|
|[AA,BB,BB]|
+----------+

关于dataframe - 如何在pyspark中连接两个数组,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/58604466/

25 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com