gpt4 book ai didi

python - 连续行之间的日期差异 - Pyspark Dataframe

转载 作者:太空狗 更新时间:2023-10-29 21:33:45 24 4
gpt4 key购买 nike

我有一个具有以下结构的表

USER_ID     Tweet_ID                 Date
1 1001 Thu Aug 05 19:11:39 +0000 2010
1 6022 Mon Aug 09 17:51:19 +0000 2010
1 1041 Sun Aug 19 11:10:09 +0000 2010
2 9483 Mon Jan 11 10:51:23 +0000 2012
2 4532 Fri May 21 11:11:11 +0000 2012
3 4374 Sat Jul 10 03:21:23 +0000 2013
3 4334 Sun Jul 11 04:53:13 +0000 2013

基本上我想做的是有一个 PysparkSQL 查询,它计算具有相同 user_id 号的连续记录的日期差异(以秒为单位)。预期结果将是:

1      Sun Aug 19 11:10:09 +0000 2010 - Mon Aug 09 17:51:19 +0000 2010     839930
1 Mon Aug 09 17:51:19 +0000 2010 - Thu Aug 05 19:11:39 +0000 2010 340780
2 Fri May 21 11:11:11 +0000 2012 - Mon Jan 11 10:51:23 +0000 2012 1813212
3 Sun Jul 11 04:53:13 +0000 2013 - Sat Jul 10 03:21:23 +0000 2013 5510

最佳答案

另一种方式是:

from pyspark.sql.functions import lag
from pyspark.sql.window import Window

df.withColumn("time_intertweet",(df.date.cast("bigint") - lag(df.date.cast("bigint"), 1)
.over(Window.partitionBy("user_‌​id")
.orderBy("date")‌​))
.cast("bigint"))

关于python - 连续行之间的日期差异 - Pyspark Dataframe,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/38156367/

24 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com