gpt4 book ai didi

group-by - pyspark:聚合列中最常见的值

转载 作者:行者123 更新时间:2023-12-02 21:48:25 25 4
gpt4 key购买 nike

  aggregrated_table = df_input.groupBy('city', 'income_bracket') \
.agg(
count('suburb').alias('suburb'),
sum('population').alias('population'),
sum('gross_income').alias('gross_income'),
sum('no_households').alias('no_households'))

想要按城市和收入阶层进行分组,但每个城市内的某些郊区有不同的收入阶层。如何按每个城市最常出现的收入阶层进行分组?

例如:

city1 suburb1 income_bracket_10 
city1 suburb1 income_bracket_10
city1 suburb2 income_bracket_10
city1 suburb3 income_bracket_11
city1 suburb4 income_bracket_10

将按收入_bracket_10分组

最佳答案

在聚合之前使用窗口函数可能会达到目的:

from pyspark.sql import Window
import pyspark.sql.functions as psf

w = Window.partitionBy('city')
aggregrated_table = df_input.withColumn(
"count",
psf.count("*").over(w)
).withColumn(
"rn",
psf.row_number().over(w.orderBy(psf.desc("count")))
).filter("rn = 1").groupBy('city', 'income_bracket').agg(
psf.count('suburb').alias('suburb'),
psf.sum('population').alias('population'),
psf.sum('gross_income').alias('gross_income'),
psf.sum('no_households').alias('no_households'))

您还可以在聚合后使用窗口函数,因为您要保存 (city, Income_bracket) 出现次数的计数。

关于group-by - pyspark:聚合列中最常见的值,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/45634725/

25 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com