gpt4 book ai didi

python - 计算 PySpark 中特定值的连续出现次数

转载 作者:行者123 更新时间:2023-12-05 00:51:11 24 4
gpt4 key购买 nike

我还定义了一个名为 info 的列:

|     Timestamp     |   info   |
+-------------------+----------+
|2016-01-01 17:54:30| 0 |
|2016-02-01 12:16:18| 0 |
|2016-03-01 12:17:57| 0 |
|2016-04-01 10:05:21| 0 |
|2016-05-11 18:58:25| 1 |
|2016-06-11 11:18:29| 1 |
|2016-07-01 12:05:21| 0 |
|2016-08-11 11:58:25| 0 |
|2016-09-11 15:18:29| 1 |

我想计算 1 的连续出现次数,否则插入 0。最后一列是:

--------------------+----------+----------+
| Timestamp | info | res |
+-------------------+----------+----------+
|2016-01-01 17:54:30| 0 | 0 |
|2016-02-01 12:16:18| 0 | 0 |
|2016-03-01 12:17:57| 0 | 0 |
|2016-04-01 10:05:21| 0 | 0 |
|2016-05-11 18:58:25| 1 | 1 |
|2016-06-11 11:18:29| 1 | 2 |
|2016-07-01 12:05:21| 0 | 0 |
|2016-08-11 11:58:25| 0 | 0 |
|2016-09-11 15:18:29| 1 | 1 |

我尝试使用以下功能,但没有成功。

df_input = df_input.withColumn(
"res",
F.when(
df_input.info == F.lag(df_input.info).over(w1),
F.sum(F.lit(1)).over(w1)
).otherwise(0)
)

最佳答案

来自 Adding a column counting cumulative pervious repeating values ,归功于@blackbishop

from pyspark.sql import functions as F, Window

df = spark.createDataFrame([0, 0, 0, 0, 1, 1, 0, 0, 1], 'int').toDF('info')

df.withColumn("ID", F.monotonically_increasing_id()) \
.withColumn("group",
F.row_number().over(Window.orderBy("ID"))
- F.row_number().over(Window.partitionBy("info").orderBy("ID"))
) \
.withColumn("Result", F.when(F.col('info') != 0, F.row_number().over(Window.partitionBy("group").orderBy("ID"))).otherwise(F.lit(0)))\
.orderBy("ID")\
.drop("ID", "group")\
.show()

+----+------+
|info|Result|
+----+------+
| 0| 0|
| 0| 0|
| 0| 0|
| 0| 0|
| 1| 1|
| 1| 2|
| 0| 0|
| 0| 0|
| 1| 1|
+----+------+

关于python - 计算 PySpark 中特定值的连续出现次数,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/72768076/

24 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com