gpt4 book ai didi

python - 当我将它链接在一起时,为什么这个重复数据删除代码不起作用?

转载 作者:太空宇宙 更新时间:2023-11-03 15:32:49 27 4
gpt4 key购买 nike

我想在此数据框中选择重复项:

df = pd.DataFrame({'firstname':['stack','Bar Bar',np.nan,'Bar Bar','john','mary','jim'],
'lastname':['jim','Bar','Foo Bar','Bar','con','sullivan','Ryan'],
'email':[np.nan,'Bar','Foo Bar','Bar','john@com','mary@com','Jim@com']})

print(df)

firstname lastname email
0 stack jim NaN
1 Bar Bar Bar Bar
2 NaN Foo Bar Foo Bar
3 Bar Bar Bar Bar
4 john con john@com
5 mary sullivan mary@com
6 jim Ryan Jim@com

这个方法似乎工作正常:

df = df.dropna(subset=['firstname', 'lastname', 'email'])

df = df[df.duplicated(subset=['firstname', 'lastname', 'email'], keep=False)]

print(df)

firstname lastname email
1 Bar Bar Bar Bar
3 Bar Bar Bar Bar

而如果我链接这些操作,它就不起作用:

dupes = (df.dropna(subset=['firstname', 'lastname', 'email'])
.duplicated(subset=['firstname', 'lastname', 'email'], keep=False))

df = df[dupes]

IndexingError: Unalignable boolean Series provided as indexer (index of the boolean Series and of the indexed object do not match

我通常应该远离这样的链接并保持简单吗?这是怎么回事?

最佳答案

这是预期的。

第二个解决方案中的问题是过滤器已经过滤了值,因此输出索引与原始索引不同,因此引发了错误。

print(df)
firstname lastname email
0 stack jim NaN
1 Bar Bar Bar Bar
2 NaN Foo Bar Foo Bar
3 Bar Bar Bar Bar
4 john con john@com
5 mary sullivan mary@com
6 jim Ryan Jim@com

dupes = (df.dropna(subset=['firstname', 'lastname', 'email'])
.duplicated(subset=['firstname', 'lastname', 'email'], keep=False))

print(dupes)
1 True
3 True
4 False
5 False
6 False
dtype: bool

在第一个示例中,您使用已过滤的数据进行过滤,因此索引相同且工作良好:

df = df.dropna(subset=['firstname', 'lastname', 'email'])
print(df)
firstname lastname email
1 Bar Bar Bar Bar
3 Bar Bar Bar Bar
4 john con john@com
5 mary sullivan mary@com
6 jim Ryan Jim@com

print(df.duplicated(subset=['firstname', 'lastname', 'email'], keep=False))
1 True
3 True
4 False
5 False
6 False
dtype: bool


df = df[df.duplicated(subset=['firstname', 'lastname', 'email'], keep=False)]
print(df)
firstname lastname email
1 Bar Bar Bar Bar
3 Bar Bar Bar Bar

可能的解决方案是使用 Series.reindex :

dupes1 = dupes.reindex(df.index, fill_value=False)
print(dupes1)
0 False
1 True
2 False
3 True
4 False
5 False
6 False
dtype: bool

dupes1 = dupes.reindex(df.index, fill_value=False)

df = df[dupes1]
print(df)
firstname lastname email
1 Bar Bar Bar Bar
3 Bar Bar Bar Bar

关于python - 当我将它链接在一起时,为什么这个重复数据删除代码不起作用?,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/56970949/

27 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com