gpt4 book ai didi

python - PySpark 数值窗口分组依据

转载 作者:太空宇宙 更新时间:2023-11-03 15:48:26 25 4
gpt4 key购买 nike

我希望能够按步长对 Spark 进行分组,而不仅仅是单个值。 spark 中是否有类似于 PySpark 2.x 的用于数字(非日期)值的 window 函数?

类似的东西:

sqlContext = SQLContext(sc)
df = sqlContext.createDataFrame([10, 11, 12, 13], "integer").toDF("foo")
res = df.groupBy(window("foo", step=2, start=10)).count()

最佳答案

可以复用时间戳一,秒级表达参数。翻滚:

from pyspark.sql.functions import col, window

df.withColumn(
"window",
window(
col("foo").cast("timestamp"),
windowDuration="2 seconds"
).cast("struct<start:bigint,end:bigint>")
).show()

# +---+-------+
# |foo| window|
# +---+-------+
# | 10|[10,12]|
# | 11|[10,12]|
# | 12|[12,14]|
# | 13|[12,14]|
# +---+-------+

滚动一:

df.withColumn(
"window",
window(
col("foo").cast("timestamp"),
windowDuration="2 seconds", slideDuration="1 seconds"
).cast("struct<start:bigint,end:bigint>")
).show()

# +---+-------+
# |foo| window|
# +---+-------+
# | 10| [9,11]|
# | 10|[10,12]|
# | 11|[10,12]|
# | 11|[11,13]|
# | 12|[11,13]|
# | 12|[12,14]|
# | 13|[12,14]|
# | 13|[13,15]|
# +---+-------+

使用 groupBystart:

w = window(col("foo").cast("timestamp"), "2 seconds").cast("struct<start:bigint,end:bigint>")
start = w.start.alias("start")
df.groupBy(start).count().show()

+-----+-----+
|start|count|
+-----+-----+
| 10| 2|
| 12| 2|
+-----+-----+

关于python - PySpark 数值窗口分组依据,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/48467215/

25 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com