gpt4 book ai didi

python - 在python中查找与特定条件匹配的重复项

转载 作者:太空宇宙 更新时间:2023-11-03 12:03:09 24 4
gpt4 key购买 nike

下面是我正在处理的示例数据。

sender  receiver    date    id
salman akhtar 20161201 1111
akhtar salman 20161201 1112
nabeel ahmed 20161201 1113
salman akhtar 20161201 1114
salman akhtar 20161202 1115
nabeel ahmed 20161202 1116
ahmed nabeel 20161202 1117
nabeel ahmed 20161202 1118
nabeel ahmed 20161202 1119

我想要实现的是根据条件、相同的发送者和相同的接收者在同一日期内找到重复的条目。

为此,我编写了以下代码。

import pandas as pd
import xlsxwriter

print 'Script for Finding duplicate entries\n'

path = raw_input('Enter file name: ')
print 'Loading file. Please wait...'

xlsx = pd.ExcelFile(path+'.xlsx')

print 'File loaded successfully.\n'
sheet = raw_input('Enter Sheet Name: ')
df = pd.read_excel(xlsx, sheet)

df['is_duplicated'] = df.duplicated(['sender', 'receiver','date'],keep=False)

df_dup = df.loc[df['is_duplicated'] == True]

print 'Found Below Duplicates'
print df_dup

writer = pd.ExcelWriter("pandas_column_formats.xlsx", engine='xlsxwriter')
df_dup.to_excel(writer, sheet_name='Sheet1')

writer.save()

print 'File created successfully.'

现在,我也想合并 fuzzywuzzy,因为当前代码只返回完全重复的行,并且我想要基于上述条件的所有可能的重复行。

谁能帮忙吗?

最佳答案

是这样的吗?

>>> fuzz_ratio = 50
>>> df_rem = df[~df.duplicated(['sender', 'receiver','date'],keep=False)]
>>> df_possible_dup = pd.merge(df_rem, df, on='date', suffixes=['', '_j'])
>>> df_possible_dup.apply(lambda x: fuzz.ratio(x['sender'], x['sender_j']) >= 50 and x['id'] != x['id_j'], axis=1)

我不知道您的确切要求,但您可能想检查发送方或接收方是否完全匹配以及其他部分是否可能匹配。然后你可以使用你的自定义函数:

def worker(x, fuzz_ratio):
if x['id'] != x['id_j']:
return False

if x['sender'] == x['sender_j'] and fuzz.ratio(x['receiver'], x['receiver_j']) > fuzz_ratio:
return True

if x['receiver'] == x['receiver_j'] and fuzz.ratio(x['sender'], x['sender_j']) > fuzz_ratio:
return True

return False

>>> df_possible_dup.apply(lambda x: worker(x, fuzz_ratio))

关于python - 在python中查找与特定条件匹配的重复项,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/41468674/

24 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com