gpt4 book ai didi

python - Pyspark:在groupBy之后获得百分比结果

转载 作者:太空宇宙 更新时间:2023-11-04 04:34:51 25 4
gpt4 key购买 nike

比如这里是我的测试数据

test = spark.createDataFrame([
(0, 1, 5, "2018-06-03", "Region A"),
(1, 1, 2, "2018-06-04", "Region B"),
(2, 2, 1, "2018-06-03", "Region B"),
(3, 3, 1, "2018-06-01", "Region A"),
(3, 1, 3, "2018-06-05", "Region A"),
])\
.toDF("orderid", "customerid", "price", "transactiondate", "location")
test.show()

我可以获得这样的摘要数据

test.groupBy("customerid", "location").agg(sum("price")).show()

enter image description here

但我还想要百分比数据,像这样

+----------+--------+----------+ 
|customerid|location|sum(price)| percentage
+----------+--------+----------+
| 1|Region B| 2| 20%
| 1|Region A| 8| 80%
| 3|Region A| 1| 100%
| 2|Region B| 1| 100%
+----------+--------+----------+

我想知道

  • 我该怎么做?也许使用窗口函数?
  • 我可以将数据透视表变成这样的东西吗? (含百分比和总和列)

enter image description here


我只在 How to get percentage of counts of a column after groupby in Pandas 找到一个 pandas 示例

更新:

在@Gordon Linoff 的帮助下,我可以通过

from pyspark.sql.window import Window
test.groupBy("customerid", "location").agg(sum("price"))\
.withColumn("percentage", col("sum(price)")/sum("sum(price)").over(Window.partitionBy(test['customerid']))).show()

最佳答案

这是针对您的问题的干净代码:

from pyspark.sql import functions as F
from pyspark.sql.window import Window

(test.groupby("customerid", "location")
.agg(F.sum("price").alias("t_price"))
.withColumn("perc", F.col("t_price") / F.sum("t_price").over(Window.partitionBy("customerid")))

关于python - Pyspark:在groupBy之后获得百分比结果,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/51964771/

25 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com