gpt4 book ai didi

python - 合并多个 Pandas 数据帧 - 有些具有共享的唯一 ID,有些具有共享列

转载 作者:太空宇宙 更新时间:2023-11-03 17:21:39 25 4
gpt4 key购买 nike

好吧,对于 pandas 和 Python 来说相对较新,所以如果我的问题非常明显,我深表歉意。已阅读有关合并、联接和连接的所有 pandas 文档,通读 Stackoverflow 和 Scriptscoop 上的所有类似问题,并观看了数小时的 pandas 教程YouTube。但还没有弄清楚如何做我想做的事情,这在pandas中似乎相对容易。

基本上,我对每种类型的阳性细菌结果(大肠杆菌、金黄色葡萄球菌等)都有一个数据框。在 DataFrame 中,有一个与患者关联的唯一 ID(订单)以及结果、日期和病房名称。一名患者可能仅对一种细菌或多种细菌呈阳性,因此数据帧之间的某些顺序号重叠,有些仅显示一次。

例如:

    Order  Test_EC  Results_EC     Date     Ward Name
0 K70201 E. coli MODERATE 2014-01-02 North
1 K70277 E. coli MODERATE 2014-01-02 North
2 K70205 E. coli FEW 2014-01-02 West
3 K70818 E. coli MODERATE 2014-01-03 South
4 K70202 E. coli FEW 2014-01-03 West
5 K80070 E. coli RARE 2014-01-03 North
6 K80666 E. coli FEW 2014-01-03 East

Order Test_SA Results_SA Date Ward Name
0 K80766 S.aureus MANY 2014-01-01 West
1 K70201 S.aureus MANY 2014-01-02 North
2 K70277 S.aureus MANY 2014-01-02 North
3 K70205 S.aureus FEW 2014-01-02 West
4 K90107 S.aureus FEW 2014-01-06 North

我想根据患者的订单号创建一个主数据库,其中包含每个阳性测试和结果的关联列,以及日期和病房名称。如果患者的一项测试呈阳性而另一项测试呈阴性,则填充 NaN 就可以了。如果来自不同 DataFrame 的两个订单号匹配,那么根据定义,它们将具有相同的日期和病房名称,因此基本上测试和结果列将是唯一的新信息。

简而言之,我希望维护每个表中包含的所有信息,同时将每个订单号的所有关联数据显示在一行中。

我希望得到如下所示的东西:

    Order   Test_EC Results_EC  Test_SA Results_SA     Date  Ward Name
0 K70201 E. coli MODERATE S.aureus MANY 2014-01-02 North
1 K70277 E. coli MODERATE S.aureus MANY 2014-01-02 North
2 K70205 E. coli FEW S.aureus FEW 2014-01-02 West
3 K70818 E. coli MODERATE NaN NaN 2014-01-03 South
4 K70202 E. coli FEW NaN NaN 2014-01-03 West
5 K80070 E. coli RARE NaN NaN 2014-01-03 North
6 K80666 E. coli FEW NaN NaN 2014-01-03 East
7 K80766 NaN NaN S.aureus MANY 2014-01-01 West
8 K90107 NaN NaN S.aureus FEW 2014-01-06 North

正如您所看到的,生成的 DataFrame 短了三行,因为有 3 名患者同时感染了大肠杆菌和金黄色葡萄球菌。订单列中没有重复值,但所有信息均已保留。

我还想通过对不同的细菌做同样的事情大约二十次来继续建立这样的数据库。实际数据集大约有 100,000 个唯一订单号。

如果我详细介绍了我尝试过的连接、合并和连接函数的所有各种组合,以及为什么它们不起作用,那么这篇文章就太长了。我知道我错过了一些明显的东西。任何想法,将不胜感激!

最佳答案

看起来您想要“外部”合并?

In [154]: df1
Out[154]:
Order Test_EC Results_EC Date Ward Name
0 K70201 E. coli MODERATE 2014-01-02 North
1 K70277 E. coli MODERATE 2014-01-02 North
2 K70205 E. coli FEW 2014-01-02 West
3 K70818 E. coli MODERATE 2014-01-03 South
4 K70202 E. coli FEW 2014-01-03 West
5 K80070 E. coli RARE 2014-01-03 North
6 K80666 E. coli FEW 2014-01-03 East

In [155]: df2
Out[155]:
Order Test_SA Results_SA Date Ward Name
0 K80766 S.aureus MANY 2014-01-01 West
1 K70201 S.aureus MANY 2014-01-02 North
2 K70277 S.aureus MANY 2014-01-02 North
3 K70205 S.aureus FEW 2014-01-02 West
4 K90107 S.aureus FEW 2014-01-06 North

In [156]: df1.merge(df2, how='outer')
Out[156]:
Order Test_EC Results_EC Date Ward Name Test_SA Results_SA
0 K70201 E. coli MODERATE 2014-01-02 North S.aureus MANY
1 K70277 E. coli MODERATE 2014-01-02 North S.aureus MANY
2 K70205 E. coli FEW 2014-01-02 West S.aureus FEW
3 K70818 E. coli MODERATE 2014-01-03 South NaN NaN
4 K70202 E. coli FEW 2014-01-03 West NaN NaN
5 K80070 E. coli RARE 2014-01-03 North NaN NaN
6 K80666 E. coli FEW 2014-01-03 East NaN NaN
7 K80766 NaN NaN 2014-01-01 West S.aureus MANY
8 K90107 NaN NaN 2014-01-06 North S.aureus FEW

关于python - 合并多个 Pandas 数据帧 - 有些具有共享的唯一 ID,有些具有共享列,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/33080119/

25 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com