gpt4 book ai didi

列表性能中的Python模糊匹配字符串

转载 作者:行者123 更新时间:2023-11-28 22:11:11 26 4
gpt4 key购买 nike

我正在检查 4 个相同的数据框列中是否有相似的结果(模糊匹配),并且我有以下代码作为示例。当我将它应用于真正的 40.000 行 x 4 列数据集时,它会在永恒中继续运行。问题是代码太慢了。例如,如果我将数据集限制为 10 个用户,则计算需要 8 分钟,而 20、19 分钟。我有什么想念的吗?我不知道为什么要花这么长时间。我希望得到所有结果,最多 2 小时或更短时间。任何提示或帮助将不胜感激。

from fuzzywuzzy import process
dataframecolumn = ["apple","tb"]
compare = ["adfad","apple","asple","tab"]
Ratios = [process.extract(x,compare) for x in dataframecolumn]
result = list()
for ratio in Ratios:
for match in ratio:
if match[1] != 100:
result.append(match)
break
print (result)

输出:[('asple', 80), ('tab', 80)]

最佳答案

Major speed improvements come by writing vectorized operations and avoiding loops

导入必要的包

from fuzzywuzzy import fuzz
import pandas as pd
import numpy as np

从第一个列表创建数据框

dataframecolumn = pd.DataFrame(["apple","tb"])
dataframecolumn.columns = ['Match']

从第二个列表创建数据框

compare = pd.DataFrame(["adfad","apple","asple","tab"])
compare.columns = ['compare']

合并 - 通过引入键的笛卡尔积(自连接)

dataframecolumn['Key'] = 1
compare['Key'] = 1
combined_dataframe = dataframecolumn.merge(compare,on="Key",how="left")
combined_dataframe = combined_dataframe[~(combined_dataframe.Match==combined_dataframe.compare)]

矢量化

def partial_match(x,y):
return(fuzz.ratio(x,y))
partial_match_vector = np.vectorize(partial_match)

使用向量化并通过在分数上设置阈值来获得期望的结果

combined_dataframe['score']=partial_match_vector(combined_dataframe['Match'],combined_dataframe['compare'])
combined_dataframe = combined_dataframe[combined_dataframe.score>=80]

结果

+--------+-----+--------+------+
| Match | Key | compare | score
+--------+-----+--------+------+
| apple | 1 | asple | 80
| tb | 1 | tab | 80
+--------+-----+--------+------+

关于列表性能中的Python模糊匹配字符串,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/56040817/

26 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com