gpt4 book ai didi

Pandas - 延长平均 session 时间

转载 作者:行者123 更新时间:2023-12-03 18:57:27 25 4
gpt4 key购买 nike

以下 DF 表示从用户接收到的事件。用户ID和事件时间戳:

    id           timestamp
0 1 2020-09-01 18:14:35
1 1 2020-09-01 18:14:39
2 1 2020-09-01 18:14:40
3 1 2020-09-01 02:09:22
4 1 2020-09-01 02:09:35
5 1 2020-09-01 02:09:53
6 1 2020-09-01 02:09:57
7 2 2020-09-01 18:14:35
8 2 2020-09-01 18:14:39
9 2 2020-09-01 18:14:40
10 2 2020-09-01 02:09:22
11 2 2020-09-01 02:09:35
12 2 2020-09-01 02:09:53
13 2 2020-09-01 02:09:57

我想获得平均扩展 session 时间。 session 被定义为由超过 5 分钟的休息时间终止的一系列事件。

我将 session 分组如下:

df.groupby(['id', pd.Grouper(key="timestamp", freq='5min', origin='start')])

并找到正确的组:

   id           timestamp
3 1 2020-09-01 02:09:22
4 1 2020-09-01 02:09:35
5 1 2020-09-01 02:09:53
6 1 2020-09-01 02:09:57
id timestamp
0 1 2020-09-01 18:14:35
1 1 2020-09-01 18:14:39
2 1 2020-09-01 18:14:40
id timestamp
10 2 2020-09-01 02:09:22
11 2 2020-09-01 02:09:35
12 2 2020-09-01 02:09:53
13 2 2020-09-01 02:09:57
id timestamp
7 2 2020-09-01 18:14:35
8 2 2020-09-01 18:14:39
9 2 2020-09-01 18:14:40

现在我想计算每个用户在任何给定行的平均 session 时间(以秒为单位),因此输出为:

    id           timestamp  avg_session_time
0 1 2020-09-01 18:14:35 0 <-- first event
1 1 2020-09-01 18:14:39 4 <-- 2nd event after 4 seconds
2 1 2020-09-01 18:14:40 5 <-- 3rd event after 5 seconds
--- session end
3 1 2020-09-01 02:09:22 5 <-- first event of second session
4 1 2020-09-01 02:09:35 9 <-- 2nd event after 13 seconds (13 seconds in the 2nd session + 5 in first session divide by the number of sessions 2)
5 1 2020-09-01 02:09:53 18 <-- 3rd event after 31 seconds ((31 + 5) / 2 = 18)
6 1 2020-09-01 02:09:57 20 <-- 4th event after 35 seconds ((35 + 5) / 2 = 20)
---
7 2 2020-09-01 18:14:35 0
8 2 2020-09-01 18:14:39 4
9 2 2020-09-01 18:14:40 5
---
10 2 2020-09-01 02:09:22 5
11 2 2020-09-01 02:09:35 9
12 2 2020-09-01 02:09:53 18
13 2 2020-09-01 02:09:57 20

任何帮助都会很棒 :)

最佳答案

使用:

#converting to datetimes
df['timestamp'] = pd.to_datetime(df['timestamp'])

#grouping per 5Min and id
g = df.groupby(['id', pd.Grouper(key="timestamp", freq='5min', origin='start')])
#get first values per groups to new column
df['diff'] = g['timestamp'].transform('first')
#subtract by timestamp and convert timedeltas to seconds
df['diff'] = df['timestamp'].sub(df['diff']).dt.total_seconds()
#shifting per groups by id
df['new'] = df.groupby('id')['diff'].shift()
#get first value per groups, now shifted
df['new'] = g['new'].transform('first')
#replace 0 to misisng values and get average
df['last'] = df[['new','diff']].replace(0, np.nan).mean(axis=1).fillna(df['new'])

print (df)
id timestamp diff new last
0 1 2020-09-01 18:14:35 0.0 0.0 0.0
1 1 2020-09-01 18:14:39 4.0 0.0 4.0
2 1 2020-09-01 18:14:40 5.0 0.0 5.0
3 1 2020-09-01 02:09:22 0.0 5.0 5.0
4 1 2020-09-01 02:09:35 13.0 5.0 9.0
5 1 2020-09-01 02:09:53 31.0 5.0 18.0
6 1 2020-09-01 02:09:57 35.0 5.0 20.0
7 2 2020-09-01 18:14:35 0.0 0.0 0.0
8 2 2020-09-01 18:14:39 4.0 0.0 4.0
9 2 2020-09-01 18:14:40 5.0 0.0 5.0
10 2 2020-09-01 02:09:22 0.0 5.0 5.0
11 2 2020-09-01 02:09:35 13.0 5.0 9.0
12 2 2020-09-01 02:09:53 31.0 5.0 18.0
13 2 2020-09-01 02:09:57 35.0 5.0 20.0

关于Pandas - 延长平均 session 时间,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/65561016/

25 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com