gpt4 book ai didi

python - 如何使用 Pandas 对 excel 文件进行模糊匹配?

转载 作者:太空宇宙 更新时间:2023-11-04 05:34:44 26 4
gpt4 key购买 nike

我有一个名为 account 的表,其中包含两列 - ID 和 NAME。 ID 是一个唯一的散列,而 NAME 是一个可能有重复的字符串。

我正在尝试编写一个 python 脚本来读取这个 excel 文件并匹配 0-3 个相似的 NAME 值,但我似乎无法让它工作。有人可以帮忙吗?谢谢

import pandas as pd
from fuzzywuzzy import fuzz
import difflib

def get_spr(row):
d = name1.apply(lambda x: (fuzz.ratio(x['NAME'], row['NAME']) * 0 if row['ID'] == x['ID'] else 1), axis=1)
d = d[d>= 60]
if len(d) == 0:
v = ['']*2
else:
v = name1.ix[d.idxmax(),['ID' , 'NAME']].values
return pd.Series(v, index=['ID', 'NAME'])


def score(tablerow):
d = name1.apply(lambda x: fuzz.ratio(x['NAME'],tablerow['NAME']) * (0 if x['ID']==tablerow['ID'] else 1), axis=1)
d = d[d>90]
if len(d) == 0:
v = [''] * 2
else:
v = name1.ix[d.order(ascending=False).head(3).index, ['ID' , 'NAME']].values
return pd.DataFrame(v, index=['ID', 'NAME'])

account = "account_test.xlsx"

xl_acc1 = pd.ExcelFile(account)
xl_acc2 = pd.ExcelFile(account)

acc1 = xl_acc1.parse(xl_acc1.sheet_names[0])
acc2 = xl_acc2.parse(xl_acc2.sheet_names[0])

name1 = acc1[pd.notnull(acc1['NAME'])]
name2 = acc2[pd.notnull(acc2['NAME'])]
print 'Doing Fuzzy Matching'


name2= pd.concat((name2,name2.apply(get_spr, axis=1)), axis=1)
name2.to_excel(pd.ExcelWriter('res.xlsx'),'acc')

如有任何帮助,我们将不胜感激!

文件有这样的行:-

ID                    NAME
0016F00001c7GDZQA2 Daniela Abriani
0016F00001c7GPnQAM Daniel Abriani
0016F00001c7JRrQAM Nisha Well
0016F00001c7Jv8QAE Katherine
0016F00001c7cXiQAI Katerine
0016F00001c7dA3QAI Katherin
0016F00001c7kHyQAI Nursing and Midwifery Council Research Office
0016F00001c8G8OQAU Nisa Well

预期(输出数据帧)会是这样的:

      ID               NAME          ID2          NAME2
<hash1> katherine <hash2> katerine
<hash1> katherine <hash3> katherin
<hash4> Nisa Well <hash5> Nisha Well

问题:上面的代码只是将输入重现为输出保存文件,而没有实际连接任何匹配项。

最佳答案

我认为您不需要在 pandas 中执行此操作。这是我草率的解决方案,但它使用字典获得了您想要的输出。

from fuzzywuzzy import process
df = pd.DataFrame([
['0016F00001c7GDZQA2', 'Daniela Abriani'],
['0016F00001c7GPnQAM', 'Daniel Abriani'],
['0016F00001c7JRrQAM', 'Nisha Well'],
['0016F00001c7Jv8QAE', 'Katherine'],
['0016F00001c7cXiQAI', 'Katerine'],
['0016F00001c7dA3QAI', 'Katherin'],
['0016F00001c7kHyQAI', 'Nursing and Midwifery Council Research Office'],
['0016F00001c8G8OQAU', 'Nisa Well']],
columns=['ID', 'NAME'])

将唯一的哈希值放入字典。

hashdict = dict(zip(df['ID'], df['NAME']))

定义一个函数checkpair。您将需要它来删除互惠哈希对。此方法将添加 (hash1, hash2)(hash2, hash1),但我认为您只想保留其中一对:

def checkpair (a,b,l):
for x in l:
if (a,b) == (x[2],x[0]):
l.remove(x)

现在遍历 hashdict.items() 找到前 3 个匹配项。模糊的docs详细说明 process 方法。

matches = []
for k,v in hashdict.items():

#see docs for extract -- 4 because you are comparing a name to itself
top3 = process.extract(v, hashdict, limit=4)

#remove the hashID compared to itself
for h in top3:
if k == h[2]:
top3.remove(h)

#append tuples to the list "matches" if it meets a score criteria
[matches.append((k, v, x[2], x[0], x[1])) for x in top3 if x[1] > 60] #change score?

#remove reciprocal pairs
[checkpair(m[0], m[2], matches) for m in matches]

df = pd.DataFrame(matches, columns=['id1', 'name1', 'id2', 'name2', 'score'])
# write to file
writer = pd.ExcelWriter('/path/to/your/file.xlsx')
df.to_excel(writer,'Sheet1')
writer.save()

输出:

    id1     name1   id2     name2   score
0 0016F00001c7JRrQAM Nisha Well 0016F00001c8G8OQAU Nisa Well 95
1 0016F00001c7GPnQAM Daniel Abriani 0016F00001c7GDZQA2 Daniela Abriani 97
2 0016F00001c7Jv8QAE Katherine 0016F00001c7dA3QAI Katherin 94
3 0016F00001c7Jv8QAE Katherine 0016F00001c7cXiQAI Katerine 94
4 0016F00001c7dA3QAI Katherin 0016F00001c7cXiQAI Katerine 88

关于python - 如何使用 Pandas 对 excel 文件进行模糊匹配?,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/35982489/

26 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com