gpt4 book ai didi

python - 如果两列之间存在反向,Pandas 会删除重复项

转载 作者:太空狗 更新时间:2023-10-30 02:45:25 25 4
gpt4 key购买 nike

我有一个包含 2 列的数据集,如下所示...

InteractorA InteractorB
AGAP028204 AGAP005846
AGAP028204 AGAP003428
AGAP028200 AGAP011124
AGAP028200 AGAP004335
AGAP028200 AGAP011356
AGAP028194 AGAP008414

我正在使用 Pandas,我想删除出现两次但像下面这样简单反转的行...从此...

InteractorA InteractorB
AGAP002741 AGAP008026
AGAP008026 AGAP002741

为此...

InteractorA InteractorB
AGAP002741 AGAP008026

因为它们在所有意图和目的上都是一样的。

是否有内置方法来处理此问题?

最佳答案

我最终制作了一个 hacky 脚本,该脚本遍历行和必要的数据片段,并检查连接是否出现或其反向是否出现并根据需要删除行索引。

import pandas as pd

checklist = []
indexes_to_drop = []

interactions = pd.read_csv('original_interactions.txt', delimiter = '\t')

for index, row in interactions.iterrows():
check_string = row['InteractorA'] + row['InteractorB']
check_string_rev = row['InteractorB'] + row['InteractorA']
if (check_string or check_string_rev) in checklist:
indexes_to_drop.append(index)
else:
pass
checklist.append(check_string)
checklist.append(check_string_rev)

no_dups = interactions.drop(interactions.index[indexes_to_drop])

print no_dups.shape

no_dups.to_csv('no_duplicates.txt',sep='\t',index = False)

2017 年编辑:几年过去了,有了更多的经验,对于任何寻找类似东西的人来说,这是一个更优雅的解决方案:

In [8]: df
Out[8]:
InteractorA InteractorB
0 AGAP028204 AGAP005846
1 AGAP028204 AGAP003428
2 AGAP028200 AGAP011124
3 AGAP028200 AGAP004335
4 AGAP028200 AGAP011356
5 AGAP028194 AGAP008414
6 AGAP002741 AGAP008026
7 AGAP008026 AGAP002741

In [18]: df['check_string'] = df.apply(lambda row: ''.join(sorted([row['InteractorA'], row['InteractorB']])), axis=1)

In [19]: df
Out[19]:
InteractorA InteractorB check_string
0 AGAP028204 AGAP005846 AGAP005846AGAP028204
1 AGAP028204 AGAP003428 AGAP003428AGAP028204
2 AGAP028200 AGAP011124 AGAP011124AGAP028200
3 AGAP028200 AGAP004335 AGAP004335AGAP028200
4 AGAP028200 AGAP011356 AGAP011356AGAP028200
5 AGAP028194 AGAP008414 AGAP008414AGAP028194
6 AGAP002741 AGAP008026 AGAP002741AGAP008026
7 AGAP008026 AGAP002741 AGAP002741AGAP008026

In [20]: df.drop_duplicates('check_string')
Out[20]:
InteractorA InteractorB check_string
0 AGAP028204 AGAP005846 AGAP005846AGAP028204
1 AGAP028204 AGAP003428 AGAP003428AGAP028204
2 AGAP028200 AGAP011124 AGAP011124AGAP028200
3 AGAP028200 AGAP004335 AGAP004335AGAP028200
4 AGAP028200 AGAP011356 AGAP011356AGAP028200
5 AGAP028194 AGAP008414 AGAP008414AGAP028194
6 AGAP002741 AGAP008026 AGAP002741AGAP008026

关于python - 如果两列之间存在反向,Pandas 会删除重复项,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/24676705/

25 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com