gpt4 book ai didi

Pyspark - 日期偏移之间的窗口函数范围

转载 作者:行者123 更新时间:2023-12-04 12:06:39 25 4
gpt4 key购买 nike

鉴于下面的示例数据,我想对每一行计算过去 3 天内看到相同“var1”值的行数。

_schema = StructType([StructField("date", StringType(), True),
StructField("var1", IntegerType(), True),
StructField("var2", StringType(), True)])


test_list = [('2017-01-30',123,'A'),
('2017-01-17',123,'B'),
('2017-01-15',123,'A'),
('2017-01-15',123,'A'),
('2017-01-14',123,'A'),
('2017-01-11',123,'B'),
('2017-01-29',456,'A'),
('2017-01-22',789,'B'),
('2017-01-21',789,'B'),
('2017-01-20',789,'A'),
('2017-01-19',789,'A')

]

df = sqlContext.createDataFrame(test_list,schema=_schema)
df=(df.withColumn('date',df.date.cast(DateType())))

我不确定如何设置 rangeBetween 来表示仅包含 var1(例如 123)存在且日期为 3 天前的行,不包括当前日期。
wSpec1=Window.partitionBy('var1').orderBy('date').rangeBetween(-3,-1)

df.withColumn("events_past_3days",F.count(df.var2).over(wSpec1))

这给了我一个超出我经验的错误:
AnalysisException: u'Window specification windowspecdefinition(var1#368, date#374 ASC, RANGE BETWEEN 3 PRECEDING AND 1 PRECEDING) is not valid because The data type of the expression in the ORDER BY clause should be a numeric type.;;\nProject [date#374, var1#368, var2#369, dayssinceJan11900#379, events_past_3days#856L]\n+- Project [date#374, var1#368, var2#369, dayssinceJan11900#379, events_past_3days#856L, events_past_3days#856L]\n   +- Window [count(var2#369) windowspecdefinition(var1#368, date#374 ASC, RANGE BETWEEN 3 PRECEDING AND 1 PRECEDING) AS events_past_3days#856L], [var1#368], [date#374 ASC]\n      +- Project [date#374, var1#368, var2#369, dayssinceJan11900#379]\n         +- Project [date#374, var1#368, var2#369, dayssinceJan11900#379, events_past_3days#641L]\n            +- Project [date#374, var1#368, var2#369, dayssinceJan11900#379, events_past_3days#641L, events_past_3days#641L]\n               +- Window [count(var2#369) windowspecdefinition(var1#368, dayssinceJan11900#379 ASC, RANGE BETWEEN 3 PRECEDING AND 1 PRECEDING) AS events_past_3days#641L], [var1#368], [dayssinceJan11900#379 ASC]\n                  +- Project [date#374, var1#368, var2#369, dayssinceJan11900#379]\n                     +- Project [date#374, var1#368, var2#369, dayssinceJan11900#379, events_past_3days#424L]\n                        +- Project [date#374, var1#368, var2#369, dayssinceJan11900#379, events_past_3days#424L, events_past_3days#424L]\n                           +- Window [count(var2#369) windowspecdefinition(var1#368, dayssinceJan11900#379 ASC, RANGE BETWEEN 3 PRECEDING AND 1 PRECEDING) AS events_past_3days#424L], [var1#368], [dayssinceJan11900#379 ASC]\n                              +- Project [date#374, var1#368, var2#369, dayssinceJan11900#379]\n                                 +- Project [date#374, var1#368, var2#369, datediff(date#374, -25567) AS dayssinceJan11900#379]\n                                    +- Project [cast(date#367 as date) AS date#374, var1#368, var2#369]\n                                       +- LogicalRDD [date#367, var1#368, var2#369]\n'

最佳答案

我发现一个解决方案是创建一个日期偏移量并在 rangeBetween 中使用该数字。我想知道有没有人有其他方法?

#add this to have a numeric to use below 
df=(df.withColumn('dayssinceJan11900',datediff(df.date,F.lit(date(1900, 1, 1)))))

wSpec1=Window.partitionBy('var1').orderBy('dayssinceJan11900').rangeBetween(-3,-1)

关于Pyspark - 日期偏移之间的窗口函数范围,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/49331103/

25 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com