gpt4 book ai didi

scala - GroupBy 和聚合未保留 Spark SQL 排序顺序?

转载 作者:行者123 更新时间:2023-12-01 06:02:21 27 4
gpt4 key购买 nike

我使用 Spark 2.1。

如果我运行以下示例:

val seq = Seq((123,"2016-01-01","1"),(123,"2016-01-02","2"),(123,"2016-01-03","3"))

val df = seq.toDF("id","date","score")

val dfAgg = df.sort("id","date").groupBy("id").agg(last("score"))

dfAgg.show
dfAgg.show
dfAgg.show
dfAgg.show
dfAgg.show

上面代码的输出是:
+---+------------------+
| id|last(score, false)|
+---+------------------+
|123| 1|
+---+------------------+

+---+------------------+
| id|last(score, false)|
+---+------------------+
|123| 2|
+---+------------------+

+---+------------------+
| id|last(score, false)|
+---+------------------+
|123| 1|
+---+------------------+

+---+------------------+
| id|last(score, false)|
+---+------------------+
|123| 3|
+---+------------------+

+---+------------------+
| id|last(score, false)|
+---+------------------+
|123| 3|
+---+------------------+

目的是获取与每个 id 的最新日期相关的分数:
+---+------------------+
| id|last(score, false)|
+---+------------------+
|123| 3|
+---+------------------+

但这显然不起作用,因为结果是不确定的。我们是否必须使用窗口函数来实现这一点?

最佳答案

查看 org.apache.spark.sql.catalyst.expressions.aggregate.Last 的文档:

/**
* Returns the last value of `child` for a group of rows. If the last value of `child`
* is `null`, it returns `null` (respecting nulls). Even if [[Last]] is used on an already
* sorted column, if we do partial aggregation and final aggregation (when mergeExpression
* is used) its result will not be deterministic (unless the input table is sorted and has
* a single partition, and we use a single reducer to do the aggregation.).
*/

表明不幸的是,这是预期的行为。

所以在回答我的问题时,现在它看起来像 Window 函数,如 SPARK DataFrame: select the first row of each group 所述可能是最好的前进方式。

关于scala - GroupBy 和聚合未保留 Spark SQL 排序顺序?,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/44267153/

27 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com