gpt4 book ai didi

python - Pandas 检查平等性太慢而无法使用

转载 作者:太空宇宙 更新时间:2023-11-04 07:57:14 24 4
gpt4 key购买 nike

我需要检查从一个 DataFrame 更改为另一个 DataFrame 的记录。它必须在所有 列上匹配。

一个是excel文件(new_df),一个是SQL查询(sql_df)。形状约为 20,000 行乘以 39 列。我认为这对 df.equals(other_df) 来说是一份好工作

目前我使用的是:

import pandas as pd
import numpy as np
new_df = pd.DataFrame({'ID' : [0 ,1, 2, 3, 4, 5, 6, 7, 8, 9],
'B' : [1,0,3,5,0,0,np.NaN,9,0,0],
'C' : [10,0,30,50,0,0,4,10,1,3],
'D' : [1,0,3,4,0,0,7,8,0,1],
'E' : ['Universtiy of New York','New Hampshire University','JMU','Oklahoma State','Penn State',
'New Mexico Univ','Rutgers','Indiana State','JMU','University of South Carolina']})

sql_df= pd.DataFrame({'ID' : [0 ,1, 2, 3, 4, 5, 6, 7, 8, 9],
'B' : [1,0,3,5,0,0,np.NaN,9,0,0],
'C' : [10,0,30,50,0,0,4,10,1,0],
'D' : [5,0,3,4,0,0,7,8,0,1],
'E' : ['Universtiy of New York','New Hampshire University','NYU','Oklahoma State','Penn State',
'New Mexico Univ','Rutgers','Indiana State','NYU','University of South Carolina']})

# creates an empty list to append to
differences = []
# for all the IDs in the dataframe that should not change check if this record is the same in the database
# must use reset_index() so the equals() will work as I expect it to
# if it is not the same, append to a list which has the Aspn ID that is failing, along with the columns that changed
for unique_id in new_df['ID'].tolist():
# get the id from the list, and filter both sql and new dfs to this record
if new_df.loc[new_df['ID'] == unique_id].reset_index(drop=True).equals(sql_df.loc[sql_df['ID'] == unique_id].reset_index(drop=True)) is False:
bad_columns = []
for column in new_df.columns.tolist():
# if not the same above, check which column using the same logic
if new_df.loc[new_df['ID'] == unique_id][column].reset_index(drop=True).equals(sql_df.loc[sql_df['ID'] == unique_id][column].reset_index(drop=True)) is False:
bad_columns.append(column)
differences.append([unique_id, bad_columns])

我稍后会使用 differencesbad_columns 并用它们做其他任务。

我希望避免许多循环...因为这可能是我的性能问题的原因。目前 20,000 条记录需要超过 5 分钟(因硬件而异),这是糟糕的性能。我正在考虑将所有列添加/连接成一个长字符串以进行比较,但这似乎是另一种低效的方法。解决这个问题的更好方法是什么/我怎样才能避免这种困惑的附加到空列表解决方案?

最佳答案

In [26]: new_df.ne(sql_df)
Out[26]:
B C D E ID
0 False False True False False
1 False False False False False
2 False False False True False
3 False False False False False
4 False False False False False
5 False False False False False
6 True False False False False
7 False False False False False
8 False False False True False
9 False True False False False

显示不同的列:

In [27]: new_df.ne(sql_df).any(axis=0)
Out[27]:
B True
C True
D True
E True
ID False
dtype: bool

显示不同的行:

In [28]: new_df.ne(sql_df).any(axis=1)
Out[28]:
0 True
1 False
2 True
3 False
4 False
5 False
6 True
7 False
8 True
9 True
dtype: bool

更新:

显示不同的细胞:

In [86]: x = new_df.ne(sql_df)

In [87]: new_df[x].loc[x.any(1)]
Out[87]:
B C D E ID
0 NaN NaN 1.0 NaN NaN
2 NaN NaN NaN JMU NaN
6 NaN NaN NaN NaN NaN
8 NaN NaN NaN JMU NaN
9 NaN 3.0 NaN NaN NaN

In [88]: sql_df[x].loc[x.any(1)]
Out[88]:
B C D E ID
0 NaN NaN 5.0 NaN NaN
2 NaN NaN NaN NYU NaN
6 NaN NaN NaN NaN NaN
8 NaN NaN NaN NYU NaN
9 NaN 0.0 NaN NaN NaN

关于python - Pandas 检查平等性太慢而无法使用,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/46717880/

24 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com