gpt4 book ai didi

python - 根据列的重复值折叠数据框并删除 NaN 值

转载 作者:太空宇宙 更新时间:2023-11-03 20:51:14 25 4
gpt4 key购买 nike

我正在使用一个具有多个实验室值的患者数据库,其中每个实验室都有自己的行,即使是在同一日期。我想根据每个患者的重复日期折叠行,以便每个日期一行包含当天所有实验室的结果。

我尝试了各种 groupby()pd.merge() 函数,但均无济于事。

玩具示例:

import pandas as pd
import numpy as np
PID = [1, 1, 1, 2, 2, 2]
ALC = [200, np.nan, np.nan, 300, np.nan, np.nan]
WBC = [np.nan, 1000, np.nan, np.nan, 2000, np.nan]
per_neut = [np.nan, np.nan, 0.64, np.nan, np.nan, 0.77]
date = ['11/1/18', '11/2/18', '11/2/18', '1/11/04',
'1/11/04','1/11/04']

prac_dict = {'PID':PID, 'date':date, 'ALC':ALC, 'WBC':WBC,
'per_neut':per_neut}
pract_df = pd.DataFrame(prac_dict)

这就是我所拥有的

print(pract_df)
PID date ALC WBC per_neut
0 1 11/1/18 200.0 NaN NaN
1 1 11/2/18 NaN 1000.0 NaN
2 1 11/2/18 NaN NaN 0.64
3 2 1/11/04 300.0 NaN NaN
4 2 1/11/04 NaN 2000.0 NaN
5 2 1/11/04 NaN NaN 0.77

这就是我想要的:

   PID     date    ALC     WBC  per_neut
0 1 11/1/18 200.0 NaN NaN
1 1 11/2/18 NaN 1000.0 0.64
2 2 1/11/04 300.0 2000.0 0.77

非常欢迎提出建议!

最佳答案

如果需要每组每列的第一个非缺失值,请使用 GroupBy.first :

df = pract_df.groupby(['PID','date'], as_index=False).first()
print (df)
PID date ALC WBC per_neut
0 1 11/1/18 200.0 NaN NaN
1 1 11/2/18 NaN 1000.0 0.64
2 2 1/11/04 300.0 2000.0 0.77

但是,如果每个组有重复值,例如 ALC 列最后一组中的 50 ,则需要指定聚合函数,例如 sum意思,如果使用第一个第二个值会丢失:

PID = [1, 1, 1, 2, 2, 2]
ALC = [200, np.nan, np.nan, 300, np.nan, 50]
WBC = [np.nan, 1000, np.nan, np.nan, 2000, np.nan]
per_neut = [np.nan, np.nan, 0.64, np.nan, np.nan, 0.77]
date = ['11/1/18', '11/2/18', '11/2/18', '1/11/04',
'1/11/04','1/11/04']

prac_dict = {'PID':PID, 'date':date, 'ALC':ALC, 'WBC':WBC,
'per_neut':per_neut}
pract_df = pd.DataFrame(prac_dict)
print (pract_df)
PID date ALC WBC per_neut
0 1 11/1/18 200.0 NaN NaN
1 1 11/2/18 NaN 1000.0 NaN
2 1 11/2/18 NaN NaN 0.64
3 2 1/11/04 300.0 NaN NaN
4 2 1/11/04 NaN 2000.0 NaN
5 2 1/11/04 50.0 NaN 0.77
<小时/>
df1 = pract_df.groupby(['PID','date'], as_index=False).sum(min_count=1)
print (df1)
PID date ALC WBC per_neut
0 1 11/1/18 200.0 NaN NaN
1 1 11/2/18 NaN 1000.0 0.64
2 2 1/11/04 350.0 2000.0 0.77

df2 = pract_df.groupby(['PID','date'], as_index=False).mean()
print (df2)
PID date ALC WBC per_neut
0 1 11/1/18 200.0 NaN NaN
1 1 11/2/18 NaN 1000.0 0.64
2 2 1/11/04 175.0 2000.0 0.77

df3 = pract_df.groupby(['PID','date'], as_index=False).first()
print (df3)
PID date ALC WBC per_neut
0 1 11/1/18 200.0 NaN NaN
1 1 11/2/18 NaN 1000.0 0.64
2 2 1/11/04 300.0 2000.0 0.77

关于python - 根据列的重复值折叠数据框并删除 NaN 值,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/56302656/

25 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com