我在下面有一个数据框。我想删除重复项,但将 E
列中的重复值添加到非重复记录
import pandas as pd
import numpy as np
dfp = pd.DataFrame({'A' : [np.NaN,np.NaN,3,4,5,5,3,1,6,7],
'B' : [1,1,3,5,0,0,np.NaN,9,0,0],
'C' : ['AA1233445','AA1233445', 'rmacy','Idaho Rx','Ab123455','TV192837','RX','Ohio Drugs','RX12345','USA Pharma'],
'D' : [123456,123456,1234567,12345678,12345,12345,12345678,123456789,1234567,np.NaN],
'E' : ['Assign','Allign','Hello','Ugly','Appreciate','Undo','Testing','Unicycle','Pharma','Unicorn',]})
print(dfp)
我正在抓取所有重复项:
df2 = dfp.loc[(dfp['A'].duplicated(keep=False))].copy()
A B C D E
0 NaN 1.0 AA1233445 123456.0 Assign
1 NaN 1.0 AA1233445 123456.0 Allign
2 3.0 3.0 rmacy 1234567.0 Hello
4 5.0 0.0 Ab123455 12345.0 Appreciate
5 5.0 0.0 TV192837 12345.0 Undo
6 3.0 NaN RX 12345678.0 Testing
并希望我的结果是:
A B C D E
0 NaN 1.0 AA1233445 123456.0 Assign Allign
2 3.0 3.0 rmacy 1234567.0 Hello Testing
4 5.0 0.0 Ab123455 12345.0 Appreciate Undo
我知道我需要使用 dfp.loc[(dfp['A'].duplicated(keep='last'))].copy()
来获取第一次出现的位置,但我我未能将 E
列的值设置为包含其他重复值。
我在想我需要尝试这样的事情:
df3 = dfp.loc[(dfp['A'].duplicated(keep='last'))].copy()
df3['E'] = df3['E'] + dfp.loc[(dfp['A'].duplicated(keep=False).copy()),'E']
但我的输出是:
A B C D E
0 NaN 1.0 AA1233445 123456.0 AssignAssign
2 3.0 3.0 rmacy 1234567.0 HelloHello
4 5.0 0.0 Ab123455 12345.0 AppreciateAppreciate
我被难住了。我把它复杂化了吗?我怎样才能得到我正在寻找的输出,以便我以后可以删除所有重复项,除了第一个,但将删除的值“保存”在 E
列中?
定义要在 agg
中使用并在 groupby
中使用的函数。为了让 groupby 与 NaN 一起工作,我先转换为字符串,然后再转换回 float 。
f = {c: ' '.join if c == 'E' else 'first' for c in ['B', 'C', 'D', 'E']}
dfp.groupby(
dfp.A.astype(str), sort=False
).agg(f).reset_index().eval(
'A = @pd.to_numeric(A, "coerce").values',
inplace=False
)
A B C D E
0 NaN 1.0 AA1233445 123456.0 Assign Allign
1 3.0 3.0 rmacy 1234567.0 Hello Testing
2 4.0 5.0 Idaho Rx 12345678.0 Ugly
3 5.0 0.0 Ab123455 12345.0 Appreciate Undo
4 1.0 9.0 Ohio Drugs 123456789.0 Unicycle
5 6.0 0.0 RX12345 1234567.0 Pharma
6 7.0 0.0 USA Pharma NaN Unicorn
将其限制为仅重复的行:
f = {c: ' '.join if c == 'E' else 'first' for c in ['B', 'C', 'D', 'E']}
d1 = dfp[dfp.duplicated('A', keep=False)]
d2 = d1.groupby(d1.A.astype(str), sort=False).agg(f).reset_index()
d2.A = d2.A.astype(float)
d2
A B C D E
0 NaN 1.0 AA1233445 123456.0 Assign Allign
1 3.0 3.0 rmacy 1234567.0 Hello Testing
2 5.0 0.0 Ab123455 12345.0 Appreciate Undo
我是一名优秀的程序员,十分优秀!