gpt4 book ai didi

apache-spark - 如何最大化并保留所有列(每组的最大记录)?

转载 作者:行者123 更新时间:2023-12-03 07:23:43 27 4
gpt4 key购买 nike

给定以下数据框:

+----+-----+---+-----+
| uid| k| v|count|
+----+-----+---+-----+
| a|pref1| b| 168|
| a|pref3| h| 168|
| a|pref3| t| 63|
| a|pref3| k| 84|
| a|pref1| e| 84|
| a|pref2| z| 105|
+----+-----+---+-----+

如何从 uid 获取最大值, k但包括v

+----+-----+---+----------+
| uid| k| v|max(count)|
+----+-----+---+----------+
| a|pref1| b| 168|
| a|pref3| h| 168|
| a|pref2| z| 105|
+----+-----+---+----------+

我可以做这样的事情,但它会删除列“v”:

df.groupBy("uid", "k").max("count")

最佳答案

这是窗口运算符(使用 over 函数)或 join 的完美示例。

由于您已经了解如何使用 Windows,因此我专门关注 join

scala> val inventory = Seq(
| ("a", "pref1", "b", 168),
| ("a", "pref3", "h", 168),
| ("a", "pref3", "t", 63)).toDF("uid", "k", "v", "count")
inventory: org.apache.spark.sql.DataFrame = [uid: string, k: string ... 2 more fields]

scala> val maxCount = inventory.groupBy("uid", "k").max("count")
maxCount: org.apache.spark.sql.DataFrame = [uid: string, k: string ... 1 more field]

scala> maxCount.show
+---+-----+----------+
|uid| k|max(count)|
+---+-----+----------+
| a|pref3| 168|
| a|pref1| 168|
+---+-----+----------+

scala> val maxCount = inventory.groupBy("uid", "k").agg(max("count") as "max")
maxCount: org.apache.spark.sql.DataFrame = [uid: string, k: string ... 1 more field]

scala> maxCount.show
+---+-----+---+
|uid| k|max|
+---+-----+---+
| a|pref3|168|
| a|pref1|168|
+---+-----+---+

scala> maxCount.join(inventory, Seq("uid", "k")).where($"max" === $"count").show
+---+-----+---+---+-----+
|uid| k|max| v|count|
+---+-----+---+---+-----+
| a|pref3|168| h| 168|
| a|pref1|168| b| 168|
+---+-----+---+---+-----+

关于apache-spark - 如何最大化并保留所有列(每组的最大记录)?,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/42636179/

27 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com