gpt4 book ai didi

python - PySpark 窗口函数 : multiple conditions in orderBy on rangeBetween/rowsBetween

转载 作者:太空狗 更新时间:2023-10-29 21:54:08 25 4
gpt4 key购买 nike

是否可以为 rangeBetweenrowsBetween 创建一个可以在 orderBy 中具有多个条件的窗口函数。假设我有一个如下所示的数据框。

user_id     timestamp               date        event
0040b5f0 2018-01-22 13:04:32 2018-01-22 1
0040b5f0 2018-01-22 13:04:35 2018-01-22 0
0040b5f0 2018-01-25 18:55:08 2018-01-25 1
0040b5f0 2018-01-25 18:56:17 2018-01-25 1
0040b5f0 2018-01-25 20:51:43 2018-01-25 1
0040b5f0 2018-01-31 07:48:43 2018-01-31 1
0040b5f0 2018-01-31 07:48:48 2018-01-31 0
0040b5f0 2018-02-02 09:40:58 2018-02-02 1
0040b5f0 2018-02-02 09:41:01 2018-02-02 0
0040b5f0 2018-02-05 14:03:27 2018-02-05 1

对于每一行,我需要日期不超过 3 天的事件 列值的总和。但是我不能对同一天晚些时候发生的事件求和。我可以创建一个窗口函数,例如:

days = lambda i: i * 86400
my_window = Window\
.partitionBy(["user_id"])\
.orderBy(F.col("date").cast("timestamp").cast("long"))\
.rangeBetween(-days(3), 0)

但这将包括同一天晚些时候发生的事件。我需要创建一个窗口函数,它的行为类似于(对于带有 * 的行):

user_id     timestamp               date        event
0040b5f0 2018-01-22 13:04:32 2018-01-22 1----|==============|
0040b5f0 2018-01-22 13:04:35 2018-01-22 0 sum here all events
0040b5f0 2018-01-25 18:55:08 2018-01-25 1 only within 3 days
* 0040b5f0 2018-01-25 18:56:17 2018-01-25 1----| |
0040b5f0 2018-01-25 20:51:43 2018-01-25 1===================|
0040b5f0 2018-01-31 07:48:43 2018-01-31 1
0040b5f0 2018-01-31 07:48:48 2018-01-31 0
0040b5f0 2018-02-02 09:40:58 2018-02-02 1
0040b5f0 2018-02-02 09:41:01 2018-02-02 0
0040b5f0 2018-02-05 14:03:27 2018-02-05 1

我尝试创建类似这样的东西:

days = lambda i: i * 86400
my_window = Window\
.partitionBy(["user_id"])\
.orderBy(F.col("date").cast("timestamp").cast("long"))\
.rangeBetween(-days(3), Window.currentRow)\
.orderBy(F.col("t_stamp"))\
.rowsBetween(Window.unboundedPreceding, Window.currentRow)

但它只反射(reflect)了最后一个orderBy

结果表应该是这样的:

user_id     timestamp               date        event   event_last_3d
0040b5f0 2018-01-22 13:04:32 2018-01-22 1 1
0040b5f0 2018-01-22 13:04:35 2018-01-22 0 1
0040b5f0 2018-01-25 18:55:08 2018-01-25 1 2
0040b5f0 2018-01-25 18:56:17 2018-01-25 1 3
0040b5f0 2018-01-25 20:51:43 2018-01-25 1 4
0040b5f0 2018-01-31 07:48:43 2018-01-31 1 1
0040b5f0 2018-01-31 07:48:48 2018-01-31 0 1
0040b5f0 2018-02-02 09:40:58 2018-02-02 1 2
0040b5f0 2018-02-02 09:41:01 2018-02-02 0 2
0040b5f0 2018-02-05 14:03:27 2018-02-05 1 2

我一直坚持这个问题有一段时间了,如果有任何关于如何处理它的建议,我将不胜感激。

最佳答案

我已经用 scala 编写了满足您要求的等价物。我认为转换为 python 应该不难:

import org.apache.spark.sql.expressions.Window
import org.apache.spark.sql.functions._
val DAY_SECS = 24*60*60 //Seconds in a day
//Given a timestamp in seconds, returns the seconds equivalent of 00:00:00 of that date
val trimToDateBoundary = (d: Long) => (d / 86400) * 86400
//Using 4 for range here - since your requirement is to cover 3 days prev, which date wise inclusive is 4 days
//So e.g. given any TS of 25 Jan, the range will cover (25 Jan 00:00:00 - 4 times day_secs = 22 Jan 00:00:00) to current TS
val wSpec = Window.partitionBy("user_id").
orderBy(col("timestamp").cast("long")).
rangeBetween(trimToDateBoundary(Window.currentRow)-(4*DAY_SECS), Window.currentRow)
df.withColumn("sum", sum('event) over wSpec).show()

以下是应用于您的数据时的输出:

+--------+--------------------+--------------------+-----+---+
| user_id| timestamp| date|event|sum|
+--------+--------------------+--------------------+-----+---+
|0040b5f0|2018-01-22 13:04:...|2018-01-22 00:00:...| 1.0|1.0|
|0040b5f0|2018-01-22 13:04:...|2018-01-22 00:00:...| 0.0|1.0|
|0040b5f0|2018-01-25 18:55:...|2018-01-25 00:00:...| 1.0|2.0|
|0040b5f0|2018-01-25 18:56:...|2018-01-25 00:00:...| 1.0|3.0|
|0040b5f0|2018-01-25 20:51:...|2018-01-25 00:00:...| 1.0|4.0|
|0040b5f0|2018-01-31 07:48:...|2018-01-31 00:00:...| 1.0|1.0|
|0040b5f0|2018-01-31 07:48:...|2018-01-31 00:00:...| 0.0|1.0|
|0040b5f0|2018-02-02 09:40:...|2018-02-02 00:00:...| 1.0|2.0|
|0040b5f0|2018-02-02 09:41:...|2018-02-02 00:00:...| 0.0|2.0|
|0040b5f0|2018-02-05 14:03:...|2018-02-05 00:00:...| 1.0|2.0|
+--------+--------------------+--------------------+-----+---+

我没有使用“日期”列。不确定我们如何在考虑的情况下实现您的要求。因此,如果 TS 的日期可能与日期列不同,则此解决方案不涵盖它。

注意:接受Column 参数的rangeBetween 已在Spark 2.3.0 中引入接受日期/时间戳类型的列。所以,这个解决方案可能更优雅。

关于python - PySpark 窗口函数 : multiple conditions in orderBy on rangeBetween/rowsBetween,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/48688780/

25 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com