gpt4 book ai didi

apache-spark - 如何使用 Spark 滞后和领先分组和排序

转载 作者:行者123 更新时间:2023-12-04 14:08:45 24 4
gpt4 key购买 nike

我使用:`

dataset.withColumn("lead",lead(dataset.col(start_date),1).over(orderBy(start_date)));

`
我只想按 trackId 添加组,以便像任何 agg 函数一样领导每个组的工作:
+----------+---------------------------------------------+
| trackId | start_time | end_time | lead |
+-----+--------------------------------------------------+
| 1 | 12:00:00 | 12:04:00 | 12:05:00 |
+----------+---------------------------------------------+
| 1 | 12:05:00 | 12:08:00 | 12:20:00 |
+----------+---------------------------------------------+
| 1 | 12:20:00 | 12:22:00 | null |
+----------+---------------------------------------------+
| 2 | 13:00:00 | 13:04:00 | 13:05:00 |
+----------+---------------------------------------------+
| 2 | 13:05:00 | 13:08:00 | 13:20:00 |
+----------+---------------------------------------------+
| 2 | 13:20:00 | 13:22:00 | null |
+----------+---------------------------------------------+

任何帮助如何做到这一点?

最佳答案

您所缺少的只是 Window关键字和 partitionBy方法调用

import org.apache.spark.sql.expressions._
import org.apache.spark.sql.functions._
dataset.withColumn("lead",lead(col("start_time"),1).over(Window.partitionBy("trackId").orderBy("start_time")))

关于apache-spark - 如何使用 Spark 滞后和领先分组和排序,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/50113504/

24 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com