gpt4 book ai didi

python-3.x - 使用条件回填 Pandas 数据框列

转载 作者:行者123 更新时间:2023-12-01 13:16:40 24 4
gpt4 key购买 nike

我有一个包含 5000 万条记录的 Pandas 数据框,我想做的是根据条件回填。正如我们所看到的,名称 800A 和 Barber 的时间戳对齐,所以我假设数据属于相同的名称,这只是记录数据时的错误。米娅的名字也是如此。

这只是样本数据。

我的数据框看起来像这样。
datetime name dischargeDate HR Sp x_inc vs_inc rec_num
01-05 18:04:50 Zawisza 14-01-05 18:05:00 119 98 FALSE TRUE 6458445
01-05 18:04:55 Zawisza 14-01-05 18:05:00 120 97 FALSE TRUE 6458445
01-05 18:05:00 Zawisza 14-01-05 18:05:00 FALSE FALSE

01-29 17:58:45 800A 14-01-29 17:59:10 FALSE FALSE

01-29 17:58:50 800A 14-01-29 17:59:10 139 FALSE TRUE

01-29 17:58:55 800A 14-01-29 17:59:10 138 FALSE TRUE

01-29 17:59:00 800A 14-01-29 17:59:10 138 96 FALSE TRUE

01-29 17:59:15 Barber 14-01-29 18:17:15 138 96 FALSE TRUE 7192783
01-29 17:59:20 Barber 14-01-29 18:17:15 138 96 FALSE TRUE 7192783
01-29 17:59:25 Barber 14-01-29 18:17:15 138 95 FALSE TRUE 7192783
03-04 21:19:45 800A 15-03-05 01:00:15 FALSE FALSE

03-05 00:53:10 800A 15-03-05 01:00:15 FALSE FALSE

03-05 00:55:50 800A 15-03-05 01:00:15 94 FALSE TRUE

03-05 00:55:55 800A 15-03-05 01:00:15 81 93 FALSE TRUE

03-05 00:56:00 800A 15-03-05 01:00:15 89 93 FALSE TRUE

03-05 01:00:20 Mia 15-03-05 04:13:15 70 93 FALSE TRUE 6728923
03-05 01:00:25 Mia 15-03-05 04:13:15 70 93 FALSE TRUE 6728923
03-05 01:00:30 Mia 15-03-05 04:13:15 70 94 FALSE TRUE 6728923

现在我正在尝试回填记录编号(rec_num)列,直到它在 x_inc 和 vs_inc 列中都映射了 bool 条件 False False 。

实际输出:
datetime name dischargeDate HR Sp x_inc vs_inc rec_num
01-05 18:04:50 Zawisza 14-01-05 18:05:00 119 98 FALSE TRUE 6458445
01-05 18:04:55 Zawisza 14-01-05 18:05:00 120 97 FALSE TRUE 6458445
01-05 18:05:00 Zawisza 14-01-05 18:05:00 FALSE FALSE 7192783
01-29 17:58:45 800A 14-01-29 17:59:10 FALSE FALSE 7192783
01-29 17:58:50 800A 14-01-29 17:59:10 139 FALSE TRUE 7192783
01-29 17:58:55 800A 14-01-29 17:59:10 138 FALSE TRUE 7192783
01-29 17:59:00 800A 14-01-29 17:59:10 138 96 FALSE TRUE 7192783
01-29 17:59:15 Barber 14-01-29 18:17:15 138 96 FALSE TRUE 7192783
01-29 17:59:20 Barber 14-01-29 18:17:15 138 96 FALSE TRUE 7192783
01-29 17:59:25 Barber 14-01-29 18:17:15 138 95 FALSE TRUE 7192783
03-04 21:19:45 800A 15-03-05 01:00:15 FALSE FALSE 6728923
03-05 00:53:10 800A 15-03-05 01:00:15 FALSE FALSE 6728923
03-05 00:55:50 800A 15-03-05 01:00:15 94 FALSE TRUE 6728923
03-05 00:55:55 800A 15-03-05 01:00:15 81 93 FALSE TRUE 6728923
03-05 00:56:00 800A 15-03-05 01:00:15 89 93 FALSE TRUE 6728923
03-05 01:00:20 Mia 15-03-05 04:13:15 70 93 FALSE TRUE 6728923
03-05 01:00:25 Mia 15-03-05 04:13:15 70 93 FALSE TRUE 6728923
03-05 01:00:30 Mia 15-03-05 04:13:15 70 94 FALSE TRUE 6728923

预期输出:
datetime name dischargeDate HR Sp x_inc vs_inc rec_num
01-05 18:04:50 Zawisza 14-01-05 18:05:00 119 98 FALSE TRUE 6458445
01-05 18:04:55 Zawisza 14-01-05 18:05:00 120 97 FALSE TRUE 6458445
01-05 18:05:00 Zawisza 14-01-05 18:05:00 FALSE FALSE

01-29 17:58:45 800A 14-01-29 17:59:10 FALSE FALSE

01-29 17:58:50 800A 14-01-29 17:59:10 139 FALSE TRUE 7192783
01-29 17:58:55 800A 14-01-29 17:59:10 138 FALSE TRUE 7192783
01-29 17:59:00 800A 14-01-29 17:59:10 138 96 FALSE TRUE 7192783
01-29 17:59:15 Barber 14-01-29 18:17:15 138 96 FALSE TRUE 7192783
01-29 17:59:20 Barber 14-01-29 18:17:15 138 96 FALSE TRUE 7192783
01-29 17:59:25 Barber 14-01-29 18:17:15 138 95 FALSE TRUE 7192783
03-04 21:19:45 800A 15-03-05 01:00:15 FALSE FALSE

03-05 00:53:10 800A 15-03-05 01:00:15 FALSE FALSE

03-05 00:55:50 800A 15-03-05 01:00:15 94 FALSE TRUE 6728923
03-05 00:55:55 800A 15-03-05 01:00:15 81 93 FALSE TRUE 6728923
03-05 00:56:00 800A 15-03-05 01:00:15 89 93 FALSE TRUE 6728923
03-05 01:00:20 Mia 15-03-05 04:13:15 70 93 FALSE TRUE 6728923
03-05 01:00:25 Mia 15-03-05 04:13:15 70 93 FALSE TRUE 6728923
03-05 01:00:30 Mia 15-03-05 04:13:15 70 94 FALSE TRUE 6728923

我正在使用 df['rec_num'].fillna(method='bfill')但它完全填满,这不是我理想的解决方案。如果我能得到关于这个问题的任何建议(或者如果有更好的方法),我将不胜感激。提前致谢。

最佳答案

使用 bool 掩码和 np.where() 你可以用这个:

cond=(df.x_inc == False) & (df.vs_inc == False) #creates a boolean mask where both columns are false
df['new_rec']=np.where(~cond,df.rec_num.bfill(),df.rec_num) #does a backfill on where condition is not met
print(df)

注意:您可以将值重新分配给名为 rec_num 的旧列。而不是创建一个新列。我加了,这样你就可以比较了。这也应该是自矢量化以来最快的方法
    datetime            name    dischargeDate       HR      Sp      x_inc   vs_inc  rec_num     new_rec
0 2019-05-01 18:04:50 Zawisza 2005-01-14 18:05:00 119.0 98.0 False True 6458445.0 6458445.0
1 2019-05-01 18:04:55 Zawisza 2005-01-14 18:05:00 120.0 97.0 False True 6458445.0 6458445.0
2 2019-05-01 18:05:00 Zawisza 2005-01-14 18:05:00 NaN NaN False False NaN NaN
3 2029-01-01 17:58:45 800A 2029-01-14 17:59:10 NaN NaN False False NaN NaN
4 2029-01-01 17:58:50 800A 2029-01-14 17:59:10 139.0 NaN False True NaN 7192783.0
5 2029-01-01 17:58:55 800A 2029-01-14 17:59:10 138.0 NaN False True NaN 7192783.0
...........................................................
...........................................................
....................................................
.....................................

关于python-3.x - 使用条件回填 Pandas 数据框列,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/54191998/

24 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com