gpt4 book ai didi

python - Groupby 序列计数和序列持续时间

转载 作者:行者123 更新时间:2023-12-01 00:43:12 24 4
gpt4 key购买 nike

我有如下所示的数据框。

   ID   Status      Date    Cost    Duration
0 1 F 2017-06-22 500 nan
1 1 M 2017-07-22 100 30.00
2 1 P 2017-10-22 100 92.00
3 1 F 2018-06-22 600 243.00
4 1 M 2018-08-22 150 61.00
5 1 P 2018-10-22 120 61.00
6 1 F 2019-03-22 750 151.00
7 2 M 2017-06-29 200 nan
8 2 F 2017-09-29 600 92.00
9 2 F 2018-01-29 500 122.00
10 2 M 2018-03-29 100 59.00
11 2 P 2018-08-29 100 153.00
12 2 M 2018-10-29 100 61.00
13 2 F 2018-12-29 500 61.00
14 3 M 2017-03-20 300 nan
15 3 F 2018-06-20 700 457.00
16 3 P 2018-08-20 100 61.00
17 3 M 2018-10-20 250 61.00
18 3 F 2018-11-20 100 31.00
19 3 P 2018-12-20 100 30.00
20 3 F 2019-03-20 600 90.00
22 4 M 2017-08-10 800 nan
23 4 F 2018-06-10 100 304.00
24 4 P 2018-08-10 120 61.00
25 4 F 2018-10-10 500 61.00
26 4 M 2019-01-10 200 92.00
27 4 F 2019-06-10 600 151.00

数据框已根据 ID 和日期排序

持续时间 = 该 ID 从上次状态到当前状态所花费的天数。

我想使用 pandas 函数准备下面的数据框。

ID  Nof_F-F  Nof_F-M  Nof_F-P Nof_M-F Nof_M-M  Nof_M-P Nof_P-F Nof_P-M
1 0 2 0 0 0 2 2 0
2 1 0 0 2 0 1 0 1
3 0 0 2 2 0 0 1 1
4 0 1 1 2 0 0 1 0

其中 Nof_F-F = 连续 F-F 的数量,依此类推。

此外,我还想准备以下数据框

    ID  Avg_F-F  Avg_F-M  Avg_F-P Avg_M-F Avg_M-M  Avg_M-P Avg_P-F Avg_P-M
1 nan 45.5 nan nan nan 76.5 197 nan
2 122 nan nan 76.5 nan 153 nan 61
3 nan nan 45.5 244 nan nan 90 61
4 nan 92 61 227.5 nan nan 61 nan

Avg_F-F = 连续 F-F 的平均天数,所有其他列也类似。

最佳答案

由于已经排序,请使用 shift +where 创建链接。然后groupby + unstack

df['link'] = df.Status.shift().where(df.ID.eq(df.ID.shift())) + '-' + df.Status

df.groupby(['ID', 'link']).agg('size').unstack().fillna(0)
#link F-F F-M F-P M-F M-P P-F P-M
#ID
#1 0.0 2.0 0.0 0.0 2.0 2.0 0.0
#2 1.0 1.0 0.0 2.0 1.0 0.0 1.0
#3 0.0 0.0 2.0 2.0 0.0 1.0 1.0
#4 0.0 1.0 1.0 2.0 0.0 1.0 0.0

df.groupby(['ID', 'link']).Duration.mean().unstack()
#link F-F F-M F-P M-F M-P P-F P-M
#ID
#1 NaN 45.5 NaN NaN 76.5 197.0 NaN
#2 122.0 59.0 NaN 76.5 153.0 NaN 61.0
#3 NaN NaN 45.5 244.0 NaN 90.0 61.0
#4 NaN 92.0 61.0 227.5 NaN 61.0 NaN
<小时/>

为了确保输出中存在所有链接,请在取消堆叠后使用reindex。您可以自己形成列表,也可以使用 unique 获取所有存在的值。

u = df.Status.unique()
cols = [f'{x}-{y}' for x in u for y in u]
#['F-F', 'F-M', 'F-P', 'M-F', 'M-M', 'M-P', 'P-F', 'P-M', 'P-P']

df.groupby(['ID', 'link']).Duration.mean().unstack().reindex(cols, axis=1)

#link F-F F-M F-P M-F M-M M-P P-F P-M P-P
#ID
#1 NaN 45.5 NaN NaN NaN 76.5 197.0 NaN NaN
#2 122.0 59.0 NaN 76.5 NaN 153.0 NaN 61.0 NaN
#3 NaN NaN 45.5 244.0 NaN NaN 90.0 61.0 NaN
#4 NaN 92.0 61.0 227.5 NaN NaN 61.0 NaN NaN

关于python - Groupby 序列计数和序列持续时间,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/57186940/

24 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com