gpt4 book ai didi

pandas - 在 pandas 中向量化可变长度的先行循环

转载 作者:行者123 更新时间:2023-12-02 02:45:11 24 4
gpt4 key购买 nike

这是我的数据的一个非常简化的版本:

+----+---------+---------------------+
| | user_id | seconds_since_start |
+----+---------+---------------------+
| 0 | 1 | 10 |
| 1 | 1 | 12 |
| 2 | 1 | 15 |
| 3 | 1 | 52 |
| 4 | 1 | 60 |
| 5 | 1 | 67 |
| 6 | 1 | 120 |
| 7 | 2 | 55 |
| 8 | 2 | 62 |
| 9 | 2 | 105 |
| 10 | 3 | 200 |
| 11 | 3 | 206 |
+----+---------+---------------------+

这是我想要生成的数据:

+----+---------+---------------------+-----------------+------------------+
| | user_id | seconds_since_start | session_ordinal | session_duration |
+----+---------+---------------------+-----------------+------------------+
| 0 | 1 | 10 | 1 | 5 |
| 1 | 1 | 12 | 1 | 5 |
| 2 | 1 | 15 | 1 | 5 |
| 3 | 1 | 52 | 2 | 15 |
| 4 | 1 | 60 | 2 | 15 |
| 5 | 1 | 67 | 2 | 15 |
| 6 | 1 | 120 | 3 | 0 |
| 7 | 2 | 55 | 1 | 7 |
| 8 | 2 | 62 | 1 | 7 |
| 9 | 2 | 105 | 2 | 0 |
| 10 | 3 | 200 | 1 | 6 |
| 11 | 3 | 206 | 1 | 6 |
+----+---------+---------------------+-----------------+------------------+

我对 session 的概念是来自单个用户的一组事件,这些事件发生的时间间隔不超过 10 秒, session 的持续时间定义为 session 中第一个事件与最后一个事件之间的差异(以秒为单位) .

我已经编写了可以实现我想要的功能的 Python。

import pandas as pd

events_data = [[1, 10], [1, 12], [1, 15], [1, 52], [1, 60], [1, 67], [1, 120],
[2, 55], [2, 62], [2, 105],
[3, 200], [3, 206]]
events = pd.DataFrame(data=events_data, columns=['user_id', 'seconds_since_start'])

def record_session(index_range, ordinal, duration):
for i in index_range:
events.at[i, 'session_ordinal'] = ordinal
events.at[i, 'session_duration'] = duration

session_indexes = []
current_user = previous_time = session_start = -1
session_num = 0
for i, row in events.iterrows():
if row['user_id'] != current_user or (row['seconds_since_start'] - previous_time) > 10:
record_session(session_indexes, session_num, previous_time - session_start)
session_indexes = [i]
session_num += 1
session_start = row['seconds_since_start']
if row['user_id'] != current_user:
current_user = row['user_id']
session_num = 1
previous_time = row['seconds_since_start']
session_indexes.append(i)
record_session(session_indexes, session_num, previous_time - session_start)

我的问题是运行时间太长。正如我所说,这是我的数据的一个非常简化的版本,我的实际数据有 70,000,000 行。有没有一种方法可以矢量化(并因此加速)像这样的基于可变长度前瞻制定额外列的算法?

最佳答案

你可以试试:

# Create a helper boolean Series
s = df.groupby('user_id')['seconds_since_start'].diff().gt(10)

df['session_ordinal'] = s.groupby(df['user_id']).cumsum().add(1).astype(int)

df['session_duration'] = (df.groupby(['user_id', 'session_ordinal'])['seconds_since_start']
.transform(lambda x: x.max() - x.min()))

[输出]

    user_id  seconds_since_start  session_ordinal  session_duration
0 1 10 1 5
1 1 12 1 5
2 1 15 1 5
3 1 52 2 15
4 1 60 2 15
5 1 67 2 15
6 1 120 3 0
7 2 55 1 7
8 2 62 1 7
9 2 105 2 0
10 3 200 1 6
11 3 206 1 6

关于pandas - 在 pandas 中向量化可变长度的先行循环,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/55646355/

24 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com