gpt4 book ai didi

Python 模糊匹配(FuzzyWuzzy)——只保留最佳匹配

转载 作者:太空狗 更新时间:2023-10-29 21:22:03 25 4
gpt4 key购买 nike

我正在尝试模糊匹配两个 csv 文件,每个文件包含一列名称,它们相似但不相同。

到目前为止我的代码如下:

import pandas as pd
from pandas import DataFrame
from fuzzywuzzy import process
import csv

save_file = open('fuzzy_match_results.csv', 'w')
writer = csv.writer(save_file, lineterminator = '\n')

def parse_csv(path):

with open(path,'r') as f:
reader = csv.reader(f, delimiter=',')
for row in reader:
yield row


if __name__ == "__main__":
## Create lookup dictionary by parsing the products csv
data = {}
for row in parse_csv('names_1.csv'):
data[row[0]] = row[0]

## For each row in the lookup compute the partial ratio
for row in parse_csv("names_2.csv"):
#print(process.extract(row,data, limit = 100))
for found, score, matchrow in process.extract(row, data, limit=100):
if score >= 60:
print('%d%% partial match: "%s" with "%s" ' % (score, row, found))
Digi_Results = [row, score, found]
writer.writerow(Digi_Results)


save_file.close()

输出如下:

Name11 , 90 , Name25 
Name11 , 85 , Name24
Name11 , 65 , Name29

脚本运行良好。输出符合预期。但我要找的只是最佳匹配。

Name11 , 90 , Name25
Name12 , 95 , Name21
Name13 , 98 , Name22

所以我需要根据第 2 列中的最大值以某种方式删除第 1 列中的重复名称。它应该相当简单,但我似乎无法弄清楚。任何帮助将不胜感激。

最佳答案

fuzzywuzzy 的 process.extract() 以倒序返回列表,最匹配的排在最前面。

所以要找到最佳匹配,您可以将限制参数设置为 1 ,这样它只返回最佳匹配,如果大于 60 ,您可以将它写入csv,就像你现在做的那样。

例子-

from fuzzywuzzy import process
## For each row in the lookup compute the partial ratio
for row in parse_csv("names_2.csv"):

for found, score, matchrow in process.extract(row, data, limit=1):
if score >= 60:
print('%d%% partial match: "%s" with "%s" ' % (score, row, found))
Digi_Results = [row, score, found]
writer.writerow(Digi_Results)

关于Python 模糊匹配(FuzzyWuzzy)——只保留最佳匹配,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/32055817/

25 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com