gpt4 book ai didi

python - 在 Pandas GroupBy 中查找组之间的重复值

转载 作者:行者123 更新时间:2023-12-02 08:53:57 24 4
gpt4 key购买 nike

df = pd.DataFrame({'A': ['foo', 'bar', 'foo', 'bar',
'foo', 'bar', 'foo', 'foo'],
'B': ['one', 'one', 'two', 'three',
'two', 'two', 'one', 'three'],
'C': [3,4,5,8,10,12,14,12]})
df.groupby(['A','B']).sum()

enter image description here

如何查找 C 列中的值在其他组中是否也重复?(这里 12 在两组中重复)

最佳答案

想法是将MultIndex转换为3列DataFrame,然后DataFrame.pivot通过 DataFrame.dropna 删除非重复行常见值在索引中:

df = pd.DataFrame({'A': ['foo', 'bar', 'foo', 'bar',
'foo', 'bar', 'foo', 'foo'],
'B': ['one', 'one', 'two', 'three',
'two', 'two', 'one', 'three'],
'C': [3,4,5,8,10,12,14,12]})
df = df.groupby(['A','B']).sum()

common = df.reset_index().pivot('C','A','B').dropna().index
print (common)
Int64Index([12], dtype='int64', name='C')

然后如果想过滤原始数据使用 boolean indexing :

df = df[df['C'].isin(common)]
print (df)
C
A B
bar two 12
foo three 12

如果希望公共(public)行至少在 2 组中重复,解决方案是:

print (df)  
A B C
0 foo one 3
1 bar one 4
2 foo two 3
3 bar three 8
4 foo two 14
5 bar two 12
6 foo one 14
7 foo three 12
8 xxx yyy 8

df = df.groupby(['A','B']).sum()
print (df)
C
A B
bar one 4
three 8 <- dupe per bar, three
two 12 <- dupe per bar, two
foo one 17 <-17 is duplicated per group foo, one, so omited
three 12 <- dupe per foo, three
two 17 <-17 is duplicated per group foo, one, so omited
xxx yyy 8 <- dupe per xxx, yyy
<小时/>
common1 = (df.reset_index()
.pivot_table(index='C',columns='A', values='B', aggfunc='size')
.notna()
.sum(axis=1)
)
common1 = common1.index[common1.gt(1)]
print (common1)
Int64Index([8, 12], dtype='int64', name='C')

df1 = df[df['C'].isin(common1)]
print (df1)
C
A B
bar three 8
two 12
foo three 12
xxx yyy 8

关于python - 在 Pandas GroupBy 中查找组之间的重复值,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/59807541/

24 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com