gpt4 book ai didi

python - 删除出现次数超过 N 次的重复值

转载 作者:太空宇宙 更新时间:2023-11-04 07:15:34 27 4
gpt4 key购买 nike

我有一个数据框,在“lid”列中有重复值。我想使用 Pandas 删除其在“lid”列中的值被计数超过 2 次的行。这是原始表格:

entity  pnb head#   state   lid
ABB001 A03 3 DOWN A
ABB001 A03 3 DOWN A
ABB001 A03 3 DOWN A
ABB002 A02 4 DOWN B
ABB002 A02 4 DOWN B
ABB002 A02 2 DOWN C
ABB002 A02 4 DOWN D
ABB002 A02 4 DOWN E
ABB002 A02 4 DOWN E
ABB002 A02 4 DOWN E

结果如下:

entity  pnb head#   state   lid
ABB002 A02 4 DOWN B
ABB002 A02 4 DOWN B
ABB002 A02 2 DOWN C
ABB002 A02 4 DOWN D

最佳答案

选项 0
使用 value_countsisin

df[~df.lid.isin(df.lid.value_counts().loc[lambda x: x > 2].index)]

entity pnb head# state lid
3 ABB002 A02 4 DOWN B
4 ABB002 A02 4 DOWN B
5 ABB002 A02 2 DOWN C
6 ABB002 A02 4 DOWN D

选项 1
最好用 np.in1dpd.factorize

实现
lids = df.lid.values
f, u = pd.factorize(df.lid.values)
df[np.in1d(lids, u[np.bincount(f) <= 2])]

entity pnb head# state lid
3 ABB002 A02 4 DOWN B
4 ABB002 A02 4 DOWN B
5 ABB002 A02 2 DOWN C
6 ABB002 A02 4 DOWN D

选项 2
使用 np.bincountpd.factorize

f, u = pd.factorize(df.lid)
df[np.bincount(f)[f] <= 2]

entity pnb head# state lid
3 ABB002 A02 4 DOWN B
4 ABB002 A02 4 DOWN B
5 ABB002 A02 2 DOWN C
6 ABB002 A02 4 DOWN D

为了有趣的演示,突出@cᴏʟᴅsᴘᴇᴇᴅ 和我在评论中谈论的内容。

Love the bincount one. There should be a np.unique one too, somewhere. – cᴏʟᴅsᴘᴇᴇᴅ

Yes there is. However, I don't use np.unique because @Jeff informed me that np.unique sorts when you grab counts or index or inverse. pd.factorize does not and is O(n). I've since validated that information. – piRSquared

时间测试

def bincount_factorize(df):
f, u = pd.factorize(df.lid.values)
return df[np.bincount(f)[f] <= 2]

def bincount_unique(df):
u, f = np.unique(df.lid.values, return_inverse=True)
return df[np.bincount(f)[f] <= 2]

def in1d_factorize(df):
lids = df.lid.values
f, u = pd.factorize(df.lid.values)
return df[np.in1d(lids, u[np.bincount(f) <= 2])]

def transform(df):
return df[df.groupby('lid')['lid'].transform('size') <= 2]

res = pd.DataFrame(
index=[10, 30, 100, 300, 1000, 3000, 10000,
30000, 100000, 300000, 1000000],
columns=['bincount_factorize', 'bincount_unique',
'in1d_factorize', 'transform'],
dtype=float
)

for i in res.index:
d = pd.concat([df] * i, ignore_index=True)
for j in res.columns:
stmt = f'{j}(d)'
setp = f'from __main__ import d, {j}'
res.at[i, j] = timeit(stmt, setp, number=100)

res.div(res.min(1), 0)

bincount_factorize bincount_unique in1d_factorize transform
10 1.421827 1.000000 1.119577 3.751167
30 1.008412 1.037297 1.000000 3.072631
100 1.000000 1.531300 1.028267 3.304560
300 1.000000 2.666583 1.182812 3.637235
1000 1.065213 5.563098 1.000000 2.556469
3000 1.024658 10.480027 1.000000 2.238765
10000 1.073403 14.716801 1.000000 1.574780
30000 1.000000 16.387130 1.053180 1.494161
100000 1.000000 18.533078 1.003031 1.369867
300000 1.078129 20.183122 1.000000 1.530698
1000000 1.166800 24.571463 1.000000 1.670423

res.plot(loglog=True)

enter image description here

关于python - 删除出现次数超过 N 次的重复值,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/48275775/

27 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com