gpt4 book ai didi

apache-spark - 在pyspark中根据时间间隔对数据进行分组

转载 作者:行者123 更新时间:2023-12-04 05:20:31 24 4
gpt4 key购买 nike

我正在尝试对数据进行分组和聚合。自从它非常简单以来,我就根据日期和其他字段对其进行了分组。现在我也在尝试根据时间间隔对其进行分组[Server_Time]

EventID AccessReason    Source  Server_Date Server_Time
847495004 Granted ORSB_GND_GYM_IN 10/1/2016 7:25:52 AM
847506432 Granted ORSB_GND_GYM_IN 10/1/2016 8:53:38 AM
847512725 Granted ORSB_GND_GYM_IN 10/1/2016 10:18:50 AM
847512768 Granted ORSB_GND_GYM_IN 10/1/2016 10:19:32 AM
847513357 Granted ORSB_GND_GYM_OUT 10/1/2016 10:25:36 AM
847513614 Granted ORSB_GND_GYM_IN 10/1/2016 10:28:08 AM
847515838 Granted ORSB_GND_GYM_OUT 10/1/2016 10:57:41 AM
847522522 Granted ORSB_GND_GYM_IN 10/1/2016 11:57:10 AM

例如。我需要汇总每小时的事件计数。从数据中我们可以看到,在 10 -11 小时,源“ORSB_GND_GYM_IN”的总计数为 3,“ORSB_GND_GYM_OUT”的总计数为 2。我如何在 pyspark 中执行此操作

最佳答案

您可以使用 Udfs 将时间转换为范围,然后进行分组

from pyspark.sql.functions import udf
from pyspark.sql import SparkSession
from pyspark.sql.types import StringType
def getInterval(time):
start = int(time.split(":")[0])
return str(start)+"-"+str(start+1)+" "+time.split(" ")[1]

getIntervalUdf = udf(getInterval,StringType())

spark = SparkSession.builder.appName("appName").getOrCreate()
df = spark.read.csv("emp",sep=",",header=True)
df.show()
df = df.withColumn("Interval",getIntervalUdf("Server_Time"))
df.show()
df = df.groupby("Server_Date","Interval","Source").count()
df.show()

输出
+-----------+--------------+------------------+-------------+-------------+
| EventID | AccessReason | Source | Server_Date | Server_Time |
+-----------+--------------+------------------+-------------+-------------+
| 847495004 | Granted | ORSB_GND_GYM_IN | 10/1/2016 | 7:25:52 AM |
| 847506432 | Granted | ORSB_GND_GYM_IN | 10/1/2016 | 8:53:38 AM |
| 847512725 | Granted | ORSB_GND_GYM_IN | 10/1/2016 | 10:18:50 AM |
| 847512768 | Granted | ORSB_GND_GYM_IN | 10/1/2016 | 10:19:32 AM |
| 847513357 | Granted | ORSB_GND_GYM_OUT | 10/1/2016 | 10:25:36 AM |
| 847513614 | Granted | ORSB_GND_GYM_IN | 10/1/2016 | 10:28:08 AM |
| 847515838 | Granted | ORSB_GND_GYM_OUT | 10/1/2016 | 10:57:41 AM |
| 847522522 | Granted | ORSB_GND_GYM_IN | 10/1/2016 | 11:57:10 AM |
+-----------+--------------+------------------+-------------+-------------+

+---------+------------+----------------+-----------+-----------+--------+
| EventID|AccessReason| Source|Server_Date|Server_Time|Interval|
+---------+------------+----------------+-----------+-----------+--------+
|847495004| Granted| ORSB_GND_GYM_IN| 10/1/2016| 7:25:52 AM| 7-8 AM|
|847506432| Granted| ORSB_GND_GYM_IN| 10/1/2016| 8:53:38 AM| 8-9 AM|
|847512725| Granted| ORSB_GND_GYM_IN| 10/1/2016|10:18:50 AM|10-11 AM|
|847512768| Granted| ORSB_GND_GYM_IN| 10/1/2016|10:19:32 AM|10-11 AM|
|847513357| Granted|ORSB_GND_GYM_OUT| 10/1/2016|10:25:36 AM|10-11 AM|
|847513614| Granted| ORSB_GND_GYM_IN| 10/1/2016|10:28:08 AM|10-11 AM|
|847515838| Granted|ORSB_GND_GYM_OUT| 10/1/2016|10:57:41 AM|10-11 AM|
|847522522| Granted| ORSB_GND_GYM_IN| 10/1/2016|11:57:10 AM|11-12 AM|
+---------+------------+----------------+-----------+-----------+--------+

+-----------+--------+----------------+-----+
|Server_Date|Interval| Source|count|
+-----------+--------+----------------+-----+
| 10/1/2016|10-11 AM| ORSB_GND_GYM_IN| 3|
| 10/1/2016| 8-9 AM| ORSB_GND_GYM_IN| 1|
| 10/1/2016|10-11 AM|ORSB_GND_GYM_OUT| 2|
| 10/1/2016|11-12 AM| ORSB_GND_GYM_IN| 1|
| 10/1/2016| 7-8 AM| ORSB_GND_GYM_IN| 1|
+-----------+--------+----------------+-----+

关于apache-spark - 在pyspark中根据时间间隔对数据进行分组,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/42976200/

24 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com