gpt4 book ai didi

python - 对 Pyspark 数据帧进行分组和过滤

转载 作者:行者123 更新时间:2023-12-01 00:30:17 25 4
gpt4 key购买 nike

我有一个包含 3 列的 PySpark 数据帧。有些行在两列中相似,但第三列不相似,请参见下面的示例。

----------------------------------------
first_name | last_name | requests_ID |
----------------------------------------
Joe | Smith |[2,3] |
----------------------------------------
Joe | Smith |[2,3,5,6] |
----------------------------------------
Jim | Bush |[9,7] |
----------------------------------------
Jim | Bush |[21] |
----------------------------------------
Sarah | Wood |[2,3] |
----------------------------------------

我想根据 {first_name, last_name} 列对行进行分组,并且只包含具有最大数量 {requests_ID} 的行。所以结果应该是:

----------------------------------------
first_name | last_name | requests_ID |
----------------------------------------
Joe | Smith |[2,3,5,6] |
----------------------------------------
Jim | Bush |[9,7] |
----------------------------------------
Sarah | Wood |[2,3] |
----------------------------------------

我尝试了如下不同的操作,但它为我提供了分组依据中两行的嵌套数组,而不是最长的行。

gr_df = filtered_df.groupBy("first_name", "last_name").agg(F.collect_set("requests_ID").alias("requests_ID")) 

这是我得到的结果:

----------------------------------------
first_name | last_name | requests_ID |
----------------------------------------
Joe | Smith |[[9,7],[2,3,5,6]]|
----------------------------------------
Jim | Bush |[[9,7],[21]] |
----------------------------------------
Sarah | Wood |[2,3] |
----------------------------------------

最佳答案

您可以使用size来确定数组列的长度并使用window,如下所示:

导入并创建示例 DataFrame

import pyspark.sql.functions as f
from pyspark.sql.window import Window

df = spark.createDataFrame([('Joe','Smith',[2,3]),
('Joe','Smith',[2,3,5,6]),
('Jim','Bush',[9,7]),
('Jim','Bush',[21]),
('Sarah','Wood',[2,3])], ('first_name','last_name','requests_ID'))

定义窗口以根据列长度降序获取requests_ID列的行号。

这里,f.size("requests_ID")将给出requests_ID列的长度,desc()将按降序对其进行排序.

w_spec = Window().partitionBy("first_name", "last_name").orderBy(f.size("requests_ID").desc())

应用窗口函数并获取第一行。

df.withColumn("rn", f.row_number().over(w_spec)).where("rn ==1").drop("rn").show()
+----------+---------+------------+
|first_name|last_name| requests_ID|
+----------+---------+------------+
| Jim| Bush| [9, 7]|
| Sarah| Wood| [2, 3]|
| Joe| Smith|[2, 3, 5, 6]|
+----------+---------+------------+

关于python - 对 Pyspark 数据帧进行分组和过滤,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/58240769/

25 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com