gpt4 book ai didi

python - Pandas DataFrame : Groupby Column, 按日期时间排序,并按条件截断组

转载 作者:太空宇宙 更新时间:2023-11-03 20:14:46 32 4
gpt4 key购买 nike

我有一个 Pandas DataFrame,看起来类似于:

import pandas as pd

df = pd.DataFrame([['a', '2018-09-30 00:03:00', 'that is a glove'],
['b', '2018-09-30 00:04:00', 'this is a glove'],
['b', '2018-09-30 00:09:00', 'she has ball'],
['a', '2018-09-30 00:05:00', 'they have a ball'],
['a', '2018-09-30 00:01:00', 'she has a shoe'],
['c', '2018-09-30 00:04:00', 'I have a baseball'],
['a', '2018-09-30 00:02:00', 'this is a hat'],
['a', '2018-09-30 00:06:00', 'he has no helmet'],
['b', '2018-09-30 00:11:00', 'he has no shoe'],
['c', '2018-09-30 00:02:00', 'we have a hat'],
['a', '2018-09-30 00:04:00', 'we have a baseball'],
['c', '2018-09-30 00:06:00', 'they have no glove'],
],
columns=['id', 'time', 'equipment'])


id time equipment
0 a 2018-09-30 00:03:00 that is a glove
1 b 2018-09-30 00:04:00 this is a glove
2 b 2018-09-30 00:09:00 she has ball
3 a 2018-09-30 00:05:00 they have a ball
4 a 2018-09-30 00:01:00 she has a shoe
5 c 2018-09-30 00:04:00 I have a baseball
6 a 2018-09-30 00:02:00 this is a hat
7 a 2018-09-30 00:06:00 he has no helmet
8 b 2018-09-30 00:11:00 he has no shoe
9 c 2018-09-30 00:02:00 we have a hat
10 a 2018-09-30 00:04:00 we have a baseball
11 c 2018-09-30 00:06:00 they have no glove

我想做的是groupby id,并在每个组中按时间排序,然后返回每一行直到并包括具有单词“ball”的行。到目前为止,我可以分组和排序:

df.groupby('id').apply(lambda x: x.sort_values(['time'], ascending=True)).reset_index(drop=True)


id time equipment
0 a 2018-09-30 00:01:00 she has a shoe
1 a 2018-09-30 00:02:00 this is a hat
2 a 2018-09-30 00:03:00 that is a glove
3 a 2018-09-30 00:04:00 we have a baseball
4 a 2018-09-30 00:05:00 they have a ball
5 a 2018-09-30 00:06:00 he has no helmet
6 b 2018-09-30 00:04:00 this is a glove
7 b 2018-09-30 00:09:00 she has ball
8 b 2018-09-30 00:11:00 he has no shoe
9 c 2018-09-30 00:02:00 we have a hat
10 c 2018-09-30 00:04:00 I have a baseball
11 c 2018-09-30 00:06:00 they have no glove

但是,我希望输出如下所示:

   id                 time           equipment
0 a 2018-09-30 00:01:00 she has a shoe
1 a 2018-09-30 00:02:00 this is a hat
2 a 2018-09-30 00:03:00 that is a glove
3 a 2018-09-30 00:04:00 we have a baseball
4 a 2018-09-30 00:05:00 they have a ball
6 b 2018-09-30 00:04:00 this is a glove
7 b 2018-09-30 00:09:00 she has ball

请注意,组 c 没有返回任何行,因为它没有包含单词“ball”的行。 c 组包含“baseball”一词,但这不是我们要查找的匹配项。同样,请注意,组 a 不会停在“baseball”行,因为我们停在“ball”行。从速度角度和内存角度来看,实现这一目标的最有效方法是什么?

最佳答案

这是我的方法:

# as the final expected output is sorted by id and time
# we start by doing so to the whole data
df = df.sort_values(['id','time'])

# mark the rows containing the word `ball`
has_ball = (df.equipment.str.contains(r'\bball\b') )

# cumulative number of rows with `ball` in the group
s = has_ball.groupby(df['id']).cumsum()

# there must be row with `ball`
valid_groups = has_ball.groupby(df['id']).transform('max')

print(df[valid_groups &
(s.eq(0) | # not containing `ball` before the first
(s.eq(1) & has_ball) # first row containing `ball`
)
]
)

输出:

   id                time           equipment
4 a 2018-09-30 00:01:00 she has a shoe
6 a 2018-09-30 00:02:00 this is a hat
0 a 2018-09-30 00:03:00 that is a glove
10 a 2018-09-30 00:04:00 we have a baseball
3 a 2018-09-30 00:05:00 they have a ball
1 b 2018-09-30 00:04:00 this is a glove
2 b 2018-09-30 00:09:00 she has ball

关于python - Pandas DataFrame : Groupby Column, 按日期时间排序,并按条件截断组,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/58525547/

32 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com