gpt4 book ai didi

Python PySpark : Count Number of Rows by Week w/Week Starting on Monday and Ending on Sunday

转载 作者:太空宇宙 更新时间:2023-11-03 21:42:47 25 4
gpt4 key购买 nike

我有一个包含以下列的数据框:

ID  Scheduled Date
241 10/9/2018
423 9/25/2018
126 9/30/2018
123 8/13/2018
132 8/16/2018
143 10/6/2018

我想按周统计ID总数。具体来说,我希望一周始终从星期一开始,并始终在星期日结束。

我已经在 J​​upyter Notebook 中实现了这一点:

weekly_count_output = df.resample('W-Mon', on='Scheduled Date', label='left', closed='left').sum().query('count_row > 0') 
weekly_count_output = weekly_count_output.reset_index()
weekly_count_output = weekly_count_output[['Scheduled Date', 'count_row']]
weekly_count_output = weekly_count_output.rename(columns = {'count_row': 'Total Count'})

但是我不知道如何用Python PySpark语法编写上面的代码。我希望我的结果输出如下所示:

Scheduled Date  Total Count
8/13/2018 2
9/24/2018 2
10/1/2018 1
10/8/2018 1

请注意,计划日期始终为星期一(表示一周的开始),并且总计数为该周的星期一到星期日。

最佳答案

感谢Get Last Monday in Spark用于定义函数previous_day

首先导入,

from pyspark.sql.functions import *
from datetime import datetime

假设您的输入数据与我的 df(DataFrame)中相同

cols = ['id', 'scheduled_date']
vals = [
(241, '10/09/2018'),
(423, '09/25/2018'),
(126, '09/30/2018'),
(123, '08/13/2018'),
(132, '08/16/2018'),
(143, '10/06/2018')
]

df = spark.createDataFrame(vals, cols)

这是定义的函数

def previous_day(date, dayOfWeek):
return date_sub(next_day(date, 'monday'), 7)

# Converting the string column to timestamp.
df = df.withColumn('scheduled_date', date_format(unix_timestamp('scheduled_date', 'MM/dd/yyy') \
.cast('timestamp'), 'yyyy-MM-dd'))

df.show()
+---+--------------+
| id|scheduled_date|
+---+--------------+
|241| 2018-10-09|
|423| 2018-09-25|
|126| 2018-09-30|
|123| 2018-08-13|
|132| 2018-08-16|
|143| 2018-10-06|
+---+--------------+

# Returns the first monday of a week
df_mon = df.withColumn("scheduled_date", previous_day('scheduled_date', 'monday'))

df_mon.show()
+---+--------------+
| id|scheduled_date|
+---+--------------+
|241| 2018-10-08|
|423| 2018-09-24|
|126| 2018-09-24|
|123| 2018-08-13|
|132| 2018-08-13|
|143| 2018-10-01|
+---+--------------+

# You can groupBy and do agg count of 'id'.
df_mon_grp = df_mon.groupBy('scheduled_date').agg(count('id')).orderBy('scheduled_date')

# Reformatting to match your resulting output.
df_mon_grp = df_mon_grp.withColumn('scheduled_date', date_format(unix_timestamp('scheduled_date', "yyyy-MM-dd") \
.cast('timestamp'), 'MM/dd/yyyy'))

df_mon_grp.show()
+--------------+---------+
|scheduled_date|count(id)|
+--------------+---------+
| 08/13/2018| 2|
| 09/24/2018| 2|
| 10/01/2018| 1|
| 10/08/2018| 1|
+--------------+---------+

关于Python PySpark : Count Number of Rows by Week w/Week Starting on Monday and Ending on Sunday,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/52724260/

25 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com