gpt4 book ai didi

python - Pandas 优化 - 如何处理两列中的 Pandas 时间序列数据并获取列之间的每小时数据

转载 作者:行者123 更新时间:2023-12-01 09:20:05 25 4
gpt4 key购买 nike

我正在尝试获取员工考勤卡数据,每行都有打洞开始时间和结束时间,其间的时间可以在 0 分钟到 9 小时之间的任何范围内。我想了解每一行员工每小时工作了多少时间。我可以通过以下方式做到这一点:

import pandas as pd 
import numpy as np
pd.options.display.max_columns = 100

store_id = np.repeat(1,10)
employee = [1,2,3,1,2,3,1,2,3,4]
start_time = pd.date_range('2015-07-03', periods=10, freq='25T')
end_time = pd.date_range('2015-07-03', periods = 10,freq = '40T')
df = pd.DataFrame({'store_id':store_id,'employee':employee,'start_time':start_time,'end_time':end_time})
df.head()
employee end_time start_time store_id
0 1 2015-07-03 00:00:00 2015-07-03 00:00:00 1
1 2 2015-07-03 00:40:00 2015-07-03 00:25:00 1
2 3 2015-07-03 01:20:00 2015-07-03 00:50:00 1
3 1 2015-07-03 02:00:00 2015-07-03 01:15:00 1
4 2 2015-07-03 02:40:00 2015-07-03 01:40:00 1

df['date']=df['start_time'].dt.date

def shift_time_in_hr(row):
#hrs- Dictionary assigning each hour as a key to the time worked in that particular hour
hrs = dict(zip(np.arange(0,25),[pd.Timedelta(0)]*24))
#Case 1: if the start time and end time in the same hour then assign minutes to the start hour
if row['start_time'].hour == row['end_time'].hour:
hrs[row['start_time'].hour]= row['end_time']-row['start_time']
return row.append(pd.Series(list(hrs.values()),index = ['{}_hr'.format(i) for i in list(hrs.keys())]))
else:
hrs_worked = np.arange(row['start_time'].hour,row['end_time'].hour+1)
#Case 2: If the start_time and end_time are in different Hours and if there are more hours in between assign them with 60 minutes
if len(hrs_worked)>2:
for i in range(hrs_worked[0]+1,hrs_worked[-1]):
hrs[i] = pd.Timedelta('1 Hour')
#Assign start_time and end_time minutes to respective hours
hrs[hrs_worked[0]] = pd.Timedelta('{} Minutes'.format(60-row['start_time'].minute))
#hrs[hrs_worked[0]] = 60-row['start_time'].minute
hrs[hrs_worked[-1]]= pd.Timedelta('{} Minutes'.format(row['end_time'].minute))
return row.append(pd.Series(list(hrs.values()),index = ['{}_hr'.format(i) for i in list(hrs.keys())]))

df=df.apply(shift_time_in_hr,axis = 1)

df.head()
employee end_time start_time store_id date 0_hr 1_hr 2_hr 3_hr 4_hr 5_hr 6_hr 7_hr 8_hr 9_hr 10_hr 11_hr 12_hr 13_hr 14_hr 15_hr 16_hr 17_hr 18_hr 19_hr 20_hr 21_hr 22_hr 23_hr
0 1 2015-07-03 00:00:00 2015-07-03 00:00:00 1 2015-07-03 00:00:00 00:00:00 00:00:00 0 days 0 days 0 days 0 days 0 days 0 days 0 days 0 days 0 days 0 days 0 days 0 days 0 days 0 days 0 days 0 days 0 days 0 days 0 days 0 days 0 days
1 2 2015-07-03 00:40:00 2015-07-03 00:25:00 1 2015-07-03 00:15:00 00:00:00 00:00:00 0 days 0 days 0 days 0 days 0 days 0 days 0 days 0 days 0 days 0 days 0 days 0 days 0 days 0 days 0 days 0 days 0 days 0 days 0 days 0 days 0 days
2 3 2015-07-03 01:20:00 2015-07-03 00:50:00 1 2015-07-03 00:10:00 00:20:00 00:00:00 0 days 0 days 0 days 0 days 0 days 0 days 0 days 0 days 0 days 0 days 0 days 0 days 0 days 0 days 0 days 0 days 0 days 0 days 0 days 0 days 0 days
3 1 2015-07-03 02:00:00 2015-07-03 01:15:00 1 2015-07-03 00:00:00 00:45:00 00:00:00 0 days 0 days 0 days 0 days 0 days 0 days 0 days 0 days 0 days 0 days 0 days 0 days 0 days 0 days 0 days 0 days 0 days 0 days 0 days 0 days 0 days
4 2 2015-07-03 02:40:00 2015-07-03 01:40:00 1 2015-07-03 00:00:00 00:20:00 00:40:00 0 days 0 days 0 days 0 days 0 days 0 days 0 days 0 days 0 days 0 days 0 days 0 days 0 days 0 days 0 days 0 days 0 days 0 days 0 days 0 days 0 days

dict_agg= dict(zip(df.columns[5:],[np.sum]*24))
df.groupby(['store_id','employee','date']).agg(dict_agg)

预期输出:一天中员工每小时工作的分钟数。

                                0_hr    1_hr    2_hr    3_hr    4_hr    5_hr    6_hr    7_hr    8_hr    9_hr    10_hr   11_hr   12_hr   13_hr   14_hr   15_hr   16_hr   17_hr   18_hr   19_hr   20_hr   21_hr   22_hr   23_hr
store_id employee date
1 1 2015-07-03 00:00:00 00:45:00 00:30:00 01:00:00 00:00:00 00:00:00 0 days 0 days 0 days 0 days 0 days 0 days 0 days 0 days 0 days 0 days 0 days 0 days 0 days 0 days 0 days 0 days 0 days 0 days
2 2015-07-03 00:15:00 00:20:00 00:45:00 01:00:00 00:40:00 00:00:00 0 days 0 days 0 days 0 days 0 days 0 days 0 days 0 days 0 days 0 days 0 days 0 days 0 days 0 days 0 days 0 days 0 days 0 days
3 2015-07-03 00:10:00 00:20:00 00:55:00 01:00:00 01:00:00 00:20:00 0 days 0 days 0 days 0 days 0 days 0 days 0 days 0 days 0 days 0 days 0 days 0 days 0 days 0 days 0 days 0 days 0 days 0 days
4 2015-07-03 00:00:00 00:00:00 00:00:00 00:15:00 01:00:00 01:00:00 0 days 0 days 0 days 0 days 0 days 0 days 0 days 0 days 0 days 0 days 0 days 0 days 0 days 0 days 0 days 0 days 0 days 0 days

有没有更好的方法来做到这一点,或者更多的Pythonic或Pandas方式来实现我能够以简单的方式做到的事情!

最佳答案

这不是完整的答案,而是您可以使用的构建 block 。

让我们计算给定开始和结束时间戳的工作分钟数,希望以一种对 pandas 更友好的方式:

import pandas as pd

def diff(ts):
ts[ts < pd.Timedelta(0)] = pd.Timedelta(0)
return (ts - ts.shift(1)).fillna(0)

def calculate_time_worked(start, end):
_range = pd.date_range(start=start.date(),
end=end.date()+pd.Timedelta('1D'),
freq='H')
base = pd.Series(_range)
time_worked = diff(base-start) - diff(base-end)
time_worked = time_worked.apply(lambda x: x.total_seconds() / 60)
time_worked.index = base
return time_worked[time_worked > 0]


start = pd.Timestamp('2017-06-13 20:11')
end = pd.Timestamp('2017-06-13 22:35')
time_worked = calculate_time_worked(start, end)
assert time_worked.to_dict() == {
pd.Timestamp('2017-06-13 21:00'): 49.0,
pd.Timestamp('2017-06-13 22:00'): 60.0,
pd.Timestamp('2017-06-13 23:00'): 35.0}

您可以通过多种方式使用该函数 - 例如,生成 (timestamp, time_worked, id, store) 的元组或字典并构建工作时间段的平面数据帧,然后在连接中将其 reshape 为所需的格式手术。请随意构建此代码并希望它有用。

关于python - Pandas 优化 - 如何处理两列中的 Pandas 时间序列数据并获取列之间的每小时数据,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/50863229/

25 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com