gpt4 book ai didi

python - 从 pandas.DataFrame 中删除所有重复项的更好策略?

转载 作者:太空宇宙 更新时间:2023-11-03 17:57:48 25 4
gpt4 key购买 nike

有人知道如何从 pandas.DataFrame 中删除 ALL 重复项的更好策略吗??

我知道df.drop_duplicates(),请参阅下面的示例:

In [340]: import pandas as pd, string, random

In [341]: a = [''.join([random.choice(string.ascii_letters+string.digits) for _ in range(4)]) for _ in range(5)]

In [342]: b = [''.join([random.choice(string.digits) for _ in range(4)]) for i in range(5)]

In [343]: df1 = pd.DataFrame([a,b],index=list('ab')).T

In [344]: df1 = df1.append(df1.loc[1:3,:])

In [345]: df1.index = range(len(df1))

In [346]: df1 = df1.append(df1.loc[1:3,:])

In [347]: df1
Out[347]:
a b
0 r4fb 4179
1 sv5e 8092
2 Oyeh 8788
3 fAdu 4018
4 PxKX 2818
5 sv5e 8092
6 Oyeh 8788
7 fAdu 4018
1 sv5e 8092
2 Oyeh 8788
3 fAdu 4018

In [348]: df1.drop_duplicates()
Out[348]:
a b
0 r4fb 4179
1 sv5e 8092
2 Oyeh 8788
3 fAdu 4018
4 PxKX 2818

请注意,这不会删除所有重复项,也就是说,它会删除每个下一个非唯一行,但保持原始行完好无损...

我目前的策略和期望的结果如下:

In [349]: same_first = df1.duplicated(subset=['a','b'])

In [350]: same_last = df1.duplicated(subset=['a','b'], take_last=True)

In [351]: rm_lst = ~(same_first|same_last)

In [352]: df1[rm_lst]
Out[352]:
a b
0 r4fb 4179
4 PxKX 2818

请注意,现在只有真正唯一行未受影响。

是否有更好的方法来获得相同的结果,也许是我错过的 oneliner?

谢谢。

最佳答案

这在一个行中完成,但可读性不太好,基本上它测试每列的值计数是否等于 1,过滤结果列表并将索引用作 bool indec:

In [260]:

df1[df1.a.isin((df1.a.value_counts()[df1.a.value_counts() == 1]).index) & (df1.b.isin((df1.b.value_counts()[df1.b.value_counts() == 1]).index))]
Out[260]:
a b
0 mlmv 3869
4 LPNz 4109

将其分解将逐条显示正在发生的情况:

In [261]:
# gengerate a series of the value counts
df1.a.value_counts()

Out[261]:
qPyr 3
ms7I 3
aOuL 3
LPNz 1
mlmv 1
dtype: int64

In [262]:
# we are only interested in the ones that have a unique value, this generates a boolean index we can use to index into the above series
df1.a.value_counts()[df1.a.value_counts() == 1]

Out[262]:
LPNz 1
mlmv 1
dtype: int64

In [264]:
# now use isin on the the result above, but we compare the values against the index of the result above
df1.a.isin((df1.a.value_counts()[df1.a.value_counts() == 1]).index)
Out[264]:
0 True
1 False
2 False
3 False
4 True
5 False
6 False
7 False
1 False
2 False
3 False
Name: a, dtype: bool

关于python - 从 pandas.DataFrame 中删除所有重复项的更好策略?,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/28239647/

25 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com