gpt4 book ai didi

python - 将 cumcount() 与重复项一起使用

转载 作者:行者123 更新时间:2023-11-30 22:11:15 24 4
gpt4 key购买 nike

我有一个看起来像这样的 df:

ID Component IDDate                   EmployeeID CreateUserID
24 1 2017-09-11 00:00:00.000 0907036 Afior
24 2 2017-09-11 00:00:00.000 0907036 Afior
24 3 2017-09-11 00:00:00.000 0907036 Afior
25 1 2017-09-12 00:00:00.000 0907036 Afior
25 3 2017-09-12 00:00:00.000 0907036 Afior
26 8 2017-09-16 00:00:00.000 1013842 JHyde
26 11 2017-09-16 00:00:00.000 1013842 JHyde
26 12 2017-09-16 00:00:00.000 1013842 JHyde
26 23 2017-09-16 00:00:00.000 1013842 JHyde
27 21 2017-09-16 00:00:00.000 0907036 Afior
27 22 2017-09-16 00:00:00.000 0907036 Afior
27 23 2017-09-16 00:00:00.000 0907036 Afior
28 15 2017-10-16 00:00:00.000 1013842 JHyde
28 16 2017-10-16 00:00:00.000 1013842 JHyde
28 19 2017-10-16 00:00:00.000 1013842 JHyde
28 25 2017-10-16 00:00:00.000 1013842 JHyde
28 26 2017-10-16 00:00:00.000 1013842 JHyde

我正在尝试使用 cumcount 创建一个变量,用于保存每个 ID/EmployeeID 组合的观察顺序。我无法将计数应用到我想要的级别,但尝试了 cumcount() 的变体,但这些变体并没有让我完全达到我想要的水平,例如:

df['seq'] = df.groupby(['EmployeeID', 'ID', 'Date']).cumcount().add(1)

df['seq'] = df.groupby(['EmployeeID', 'Date']).cumcount().add(1)

df['seq'] = df.groupby(['EmployeeID', 'ID']).cumcount().add(1)

理想情况下,我的输出如下所示:

ID Component IDDate                   EmployeeID CreateUserID seq
24 1 2017-09-11 00:00:00.000 0907036 Afior 1
24 2 2017-09-11 00:00:00.000 0907036 Afior 1
24 3 2017-09-11 00:00:00.000 0907036 Afior 1
25 1 2017-09-12 00:00:00.000 0907036 Afior 2
25 3 2017-09-12 00:00:00.000 0907036 Afior 2
26 8 2017-09-16 00:00:00.000 1013842 JHyde 1
26 11 2017-09-16 00:00:00.000 1013842 JHyde 1
26 12 2017-09-16 00:00:00.000 1013842 JHyde 1
26 23 2017-09-16 00:00:00.000 1013842 JHyde 1
27 21 2017-09-16 00:00:00.000 0907036 Afior 3
27 22 2017-09-16 00:00:00.000 0907036 Afior 3
27 23 2017-09-16 00:00:00.000 0907036 Afior 3
28 15 2017-10-16 00:00:00.000 1013842 JHyde 2
28 16 2017-10-16 00:00:00.000 1013842 JHyde 2
28 19 2017-10-16 00:00:00.000 1013842 JHyde 2
28 25 2017-10-16 00:00:00.000 1013842 JHyde 2
28 26 2017-10-16 00:00:00.000 1013842 JHyde 2

有没有办法处理重复数据,让我得到这个输出?首先使 df 变宽然后应用 cumcount() 会更好吗?

最佳答案

这是一种方法,本质上仅按EmployeeID进行分组,然后检查ID是否从一行更改为下一行,并返回cumsum (这基于您的尝试和您想要的输出)。

df['seq'] = df.groupby('EmployeeID')['ID'].transform(lambda x: x.ne(x.shift()).cumsum())

>>> df
ID Component IDDate EmployeeID CreateUserID seq
0 24 1 2017-09-11 00:00:00.000 907036 Afior 1
1 24 2 2017-09-11 00:00:00.000 907036 Afior 1
2 24 3 2017-09-11 00:00:00.000 907036 Afior 1
3 25 1 2017-09-12 00:00:00.000 907036 Afior 2
4 25 3 2017-09-12 00:00:00.000 907036 Afior 2
5 26 8 2017-09-16 00:00:00.000 1013842 JHyde 1
6 26 11 2017-09-16 00:00:00.000 1013842 JHyde 1
7 26 12 2017-09-16 00:00:00.000 1013842 JHyde 1
8 26 23 2017-09-16 00:00:00.000 1013842 JHyde 1
9 27 21 2017-09-16 00:00:00.000 907036 Afior 3
10 27 22 2017-09-16 00:00:00.000 907036 Afior 3
11 27 23 2017-09-16 00:00:00.000 907036 Afior 3
12 28 15 2017-10-16 00:00:00.000 1013842 JHyde 2
13 28 16 2017-10-16 00:00:00.000 1013842 JHyde 2
14 28 19 2017-10-16 00:00:00.000 1013842 JHyde 2
15 28 25 2017-10-16 00:00:00.000 1013842 JHyde 2
16 28 26 2017-10-16 00:00:00.000 1013842 JHyde 2

关于python - 将 cumcount() 与重复项一起使用,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/51483722/

24 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com