gpt4 book ai didi

python - Spark 枢转一列,但保持其他列完好无损

转载 作者:太空宇宙 更新时间:2023-11-03 16:23:03 26 4
gpt4 key购买 nike

给定以下数据框,我如何旋转最大分数但聚合播放总和?

from pyspark import SparkContext
from pyspark.sql import HiveContext
from pyspark.sql import functions as F
from pyspark.sql import Window

df = sqlContext.createDataFrame([
("u1", "g1", 10, 0, 1),
("u1", "g3", 2, 2, 1),
("u1", "g3", 5, 3, 1),
("u1", "g4", 5, 4, 1),
("u2", "g2", 1, 1, 1),
], ["UserID", "GameID", "Score", "Time", "Plays"])

所需输出

+------+-------------+-------------+-----+
|UserID|MaxScoreGame1|MaxScoreGame2|Plays|
+------+-------------+-------------+-----+
| u1| 10| 5| 4|
| u2| 1| null| 1|
+------+-------------+-------------+-----+

我在下面发布了一个解决方案,但我希望避免使用 join。

最佳答案

我认为这不是真正的改进,但您可以添加播放总数

...
.select(
F.col("*"),
F.row_number().over(rowNumberWindow).alias("GameNumber"),
F.sum("Plays").over(rowNumberWindow.orderBy()).alias("total_plays")
)
...

稍后将其用作数据透视的辅助分组列:

...
.groupBy("UserID", "total_plays")
.pivot("GameCol", ["MaxScoreGame1", "MaxScoreGame2"])
.agg(F.max("Score"))
...

关于python - Spark 枢转一列,但保持其他列完好无损,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/38232249/

26 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com