gpt4 book ai didi

python - Pandas :按每组中的复杂条件选择子数据集

转载 作者:行者123 更新时间:2023-12-04 15:12:01 25 4
gpt4 key购买 nike

我需要从给定的数据框中选择一个子集。这是 df:

import pandas as pd
import numpy as np

df = pd.DataFrame({
'custom_id': ['aa','aa','aa','aa','aa','aa',
'bk', 'bk', 'bk', 'bk','bk',
'dd', 'dd', 'dd', 'dd', 'dd',
'ff', 'ff', 'ff', 'ff', 'ff', 'ff',
'pu', 'pu', 'pu', 'pu'],
'sending_num': [11, 252, 198, 266, 5317, 'from',
67, 287, 909, 881, 'from',
22, 55, 'from', 376, 98,
901, 126, 22, 381, 867, 'from',
421, 81, 326, 'from'],
'receiving_num': [900, 11, 252, 198, 266, 5317,
345, 67, 287, 909, 881,
432, 22, 55, 65, 376,
42, 901, 126, 22, 381, 867,
66, 421, 81, 326],
'note': [np.nan, 'flag', np.nan, np.nan, 'flag', np.nan,
'flag', np.nan, np.nan, np.nan, np.nan,
np.nan, 'flag', np.nan, np.nan, np.nan,
np.nan, np.nan, np.nan, np.nan, 'flag', np.nan,
np.nan, np.nan, np.nan, np.nan]
})

而 df 是这样的:

   custom_id sending_num  receiving_num  note
0 aa 11 900 NaN
1 aa 252 11 flag
2 aa 198 252 NaN
3 aa 266 198 NaN
4 aa 5317 266 **flag**
5 aa **from** 5317 NaN
6 bk 67 345 flag
7 bk 287 67 NaN
8 bk 909 287 NaN
9 bk 881 909 NaN
10 bk from 881 NaN
11 dd 22 432 NaN
12 dd 55 22 **flag**
13 dd **from** 55 NaN
14 dd 376 65 NaN
15 dd 98 376 NaN
16 ff 901 42 NaN
17 ff 126 901 NaN
18 ff 22 126 NaN
19 ff 381 22 NaN
20 ff 867 381 **flag**
21 ff **from** 867 NaN
22 pu 421 66 NaN
23 pu 81 421 NaN
24 pu 326 81 NaN
25 pu from 326 NaN

我希望根据以下规则选择一个子集:对于每个组(自定义 id),如果:行中出现“来自”,并且在其“注释”列中,上面的行有一个“标志'值。例如,对于“aa”组,其“sending_num”列中有一个“from”,同时,在其上方的行(第 4 行)中,同一组中的“note”列中有一个“flag” , 所以 'aa' 是一个目标;类似于'dd'和'ff'组,因为在他们的'sending_num'列中有'from',而在'note'列的上面一行中有'flag',所以这两个被选中,而不是其他团体。我尝试编写一个循环和 iloc 来执行此操作,但速度很慢。最终,我希望根据规则有一个这样的子集:

   custom_id sending_num  receiving_num  note
0 aa 11 900 NaN
1 aa 252 11 flag
2 aa 198 252 NaN
3 aa 266 198 NaN
4 aa 5317 266 flag # 'flag' row &
5 aa from 5317 NaN # 'from' row are adjacent for 'aa'
6 dd 22 432 NaN
7 dd 55 22 flag # 'flag' row &
8 dd from 55 NaN # 'from' row are adjacent for 'dd'
9 dd 376 65 NaN
10 dd 98 376 NaN
11 ff 901 42 NaN
12 ff 126 901 NaN
13 ff 22 126 NaN
14 ff 381 22 NaN
15 ff 867 381 flag # 'flag' row &
16 ff from 867 NaN # 'from' row are adjacent for 'ff'

如果有人能提供帮助,我将不胜感激。

最佳答案

让我们groupby custom_id 上的数据帧和filter 使用自定义lambda 函数f 返回一个基于指定条件的 bool 值:

f = lambda g: (g['sending_num'].eq('from') & g['note'].shift().eq('flag')).any()
sub_df = df.groupby('custom_id').filter(f)

或者您可以先根据指定条件创建一个 bool 掩码,然后使用此掩码获取满足规则的custom_id:

m = df.groupby('custom_id')['note'].shift().eq('flag') & df['sending_num'].eq('from')
sub_df = df[df['custom_id'].isin(df.loc[m, 'custom_id'].unique())].copy()

print(sub_df)

custom_id sending_num receiving_num note
0 aa 11 900 NaN
1 aa 252 11 flag
2 aa 198 252 NaN
3 aa 266 198 NaN
4 aa 5317 266 flag
5 aa from 5317 NaN
11 dd 22 432 NaN
12 dd 55 22 flag
13 dd from 55 NaN
14 dd 376 65 NaN
15 dd 98 376 NaN
16 ff 901 42 NaN
17 ff 126 901 NaN
18 ff 22 126 NaN
19 ff 381 22 NaN
20 ff 867 381 flag
21 ff from 867 NaN

关于python - Pandas :按每组中的复杂条件选择子数据集,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/65059391/

25 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com