gpt4 book ai didi

python - PySpark:基于 array_contains 加入数据框列

转载 作者:行者123 更新时间:2023-12-05 07:11:24 27 4
gpt4 key购买 nike

我有两个数据框:

sdf1 = spark.createDataFrame([
("123", "A", [1, 2, 3]),
("123","B", [4, 5]),
("456","C", [1, 2]),
("456","D", [3, 4, 5]),
], ["id1", "name", "resources"])

sdf2 = spark.createDataFrame([
("123", 1, "R1"),
("123", 2, "R2"),
("123", 3, "R3"),
("123", 4, "R4"),
("123", 5, "R5"),
("456", 1, "R1"),
("456", 2, "R2"),
("456", 3, "R7"),
("456", 4, "R8"),
("456", 5, "R9")
], ["id2", "resource_id", "name"])

预期结果:

+----+-----+-----------+-------------+
|id1 |name |resources |New Column |
+----+-----+-----------+-------------+
|123 |A |[1, 2, 3] |[R1, R2, R3] |
|123 |B |[4, 5] |[R4, R5] |
|456 |C |[1, 2] |[R1, R2] |
|456 |D |[3, 4, 5] |[R7, R8, R9] |
+----+---------+------+--------------+

我这样试过:

res_sdf = sdf1.join(sdf2, on=[(sdf1.id1 == sdf2.id2) & array_contains(sdf1.resources, sdf2.resource_id)], how='left')

但我收到错误:TypeError: Column is not iterable

正确的做法是什么?

谢谢!

最佳答案

试试这段代码:

    from pyspark.sql.functions import udf , collect_list

contain_udf = udf(lambda x , y : x in y)

res_sdf = sdf1.join(sdf2, on=[(sdf1.id1 == sdf2.id2)] ,how ='left').filter(contain_udf("resource_id","resources") == True)
res_sdf = res_sdf.groupBy(sdf1.id1,sdf1.name,"resources").agg(collect_list(sdf2.name).alias("New Column")).orderBy("id1")

关于python - PySpark:基于 array_contains 加入数据框列,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/60784048/

27 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com