python - 多处理模糊 wuzzy 字符串搜索

python - 多处理模糊 wuzzy 字符串搜索 - python

转载作者：太空宇宙更新时间：2023-11-04 05:05:18

我正在尝试在 python 中使用模糊 wuzzy 进行字符串匹配并带来匹配 ID。我的数据集很大，数据集 1 = 180 万条记录，数据集 2 = 160 万条记录。

到目前为止我尝试了什么，

首先我尝试在python中使用record linkage包，不幸的是它在构建multi index时内存不足，所以我转移到机器性能好的AWS并成功构建了它，但是当我尝试对其进行比较时，它会永远运行，我同意这是由于比较的数量。

然后，我尝试使用 fuzzy wuzzy 进行字符串匹配，并使用 dask 包并行处理该过程。并在示例数据上执行它。它工作正常，但我知道这个过程仍然需要时间，因为搜索空间很大。我正在寻找一种在这段代码上添加阻塞或索引的方法。

test = pd.DataFrame({'Address1':['123 Cheese Way','234 Cookie Place','345 Pizza Drive','456 Pretzel Junction'],'city':['X','U','X','U']}) 
test2 = pd.DataFrame({'Address1':['123 chese wy','234 kookie Pl','345 Pizzza DR','456 Pretzel Junktion'],'city':['X','U','Z','Y'] , 'ID' : ['1','3','4','8']})

在这里，我试图在 test2.Address1 中寻找 test.Address1 并带上它的 ID。

def fuzzy_score(str1, str2):
    return fuzz.token_set_ratio(str1, str2)

def helper(orig_string, slave_df):
    slave_df['score'] = slave_df.Address1.apply(lambda x: fuzzy_score(x,orig_string))
    #return my_value corresponding to the highest score
    return slave_df.ix[slave_df.score.idxmax(),'ID']

dmaster = dd.from_pandas(test, npartitions=24)
dmaster = dmaster.assign(ID_there=dmaster.Address1.apply(lambda x: helper(x, test2)))
dmaster.compute(get=dask.multiprocessing.get)

这很好用，但是我不确定如何通过限制同一城市的搜索空间来对其应用索引。

比方说，我正在基于原始字符串的城市在城市字段和子集上创建索引，并将该城市传递给辅助函数，

# sort the dataframe
test2.sort_values(by=['city'], inplace=True)
# set the index to be this and don't drop
test2.set_index(keys=['city'], drop=False,inplace=True)

我不知道该怎么做？请指教。提前致谢。

最佳答案

我更喜欢使用 fuzzywuzzy.process.extractOne。将一个字符串与一个可迭代的字符串进行比较。

def extract_one(col, other):
    # need this for dask later
    other = other.compute() if hasattr(other, 'compute') else other
    return pd.DataFrame([process.extractOne(x, other) for x in col],
                        columns=['Address1', 'score', 'idx'],
                        index=col.index)

extract_one(test.Address1, test2.Address1)

               Address1  score  idx
0          123 chese wy     92    0
1         234 kookie Pl     83    1
2         345 Pizzza DR     86    2
3  456 Pretzel Junktion     95    3

idx 是传递给 extract_one 最匹配的 other 的索引。我会建议有一个有意义的索引，以便以后更容易加入结果。

对于你的第二个问题，关于过滤城市，我会使用 groupby 并应用

gr1 = test.groupby('city')
gr2 = test2.groupby("city")

gr1.apply(lambda x: extract_one(x.Address1, 
gr2.get_group(x.name).Address1))

               Address1  score  idx
0          123 chese wy     92    0
1         234 kookie Pl     83    1
2         345 Pizzza DR     86    2
3  456 Pretzel Junktion     95    3

与 dask 的唯一区别是需要为应用指定一个 meta:

ddf1 = dd.from_pandas(test, 2)
ddf2 = dd.from_pandas(test2, 2)

dgr1 = ddf1.groupby('city')
dgr2 = ddf2.groupby('city')

meta = pd.DataFrame(columns=['Address1', 'score', 'idx'])
dgr1.apply(lambda x: extract_one(x.Address1, 

dgr2.get_group(x.name).Address1),
               meta=meta).compute()

             Address1  score  idx
city                             
U    0  234 kookie Pl     83    1
     1  234 kookie Pl     28    1
X    0   123 chese wy     92    0
     1   123 chese wy     28    0

这是一个笔记本:https://gist.github.com/a932b3591346b898d6816a5efc2bc5ad

我很想知道性能如何。我假设在 fuzzy wuzzy 中完成的实际字符串比较将花费大部分时间，但我很想听听关于在 pandas 和 dask 中花费了多少开销的反馈。确保您具有用于计算 Levenshtein 距离的 C 扩展。

关于python - 多处理模糊 wuzzy 字符串搜索 - python，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/44666325/

文章推荐： c - 在c数组中找到中间点

文章推荐： linux - Apt-get损坏

文章推荐： python - 通过索引创建有序矩阵

python - 多处理模糊 wuzzy 字符串搜索 - python
我正在尝试在 python 中使用模糊 wuzzy 进行字符串匹配并带来匹配 ID。我的数据集很大，数据集 1 = 180 万条记录，数据集 2 = 160 万条记录。到目前为止我尝试了什么，首先
python - 基于条件的 2 个大数据集的模糊 Wuzzy 字符串匹配 - python
我有 2 个大型数据集，已读入 Pandas DataFrames(分别为约 20K 行和约 40K 行)。当我尝试在地址字段上使用 pandas.merge 完全合并这两个 DF 时，与行数相比，我

太空宇宙

个人简介

我是一名优秀的程序员,十分优秀！

作者热门文章

滴滴打车优惠券免费领取

全站热门文章

首页

博学

6Ren·AI

商城

python - 多处理模糊 wuzzy 字符串搜索 - python