gpt4 book ai didi

python - Python 的 difflib 中的 SequenceMatcher 是否有可能提供一种更有效的方法来计算 Levenshtein 距离?

转载 作者:太空狗 更新时间:2023-10-29 21:55:33 25 4
gpt4 key购买 nike

这是计算 Levenshtein 距离的一般算法的教科书示例(我从 Magnus Hetland's webite 中提取):

def levenshtein(a,b):
"Calculates the Levenshtein distance between a and b."
n, m = len(a), len(b)
if n > m:
# Make sure n <= m, to use O(min(n,m)) space
a,b = b,a
n,m = m,n

current = range(n+1)
for i in range(1,m+1):
previous, current = current, [i]+[0]*n
for j in range(1,n+1):
add, delete = previous[j]+1, current[j-1]+1
change = previous[j-1]
if a[j-1] != b[i-1]:
change = change + 1
current[j] = min(add, delete, change)

return current[n]

不过,我想知道是否有使用 difflib 的 SequenceManager 的更高效(并且可能更优雅)的纯 Python 实现。在玩弄它之后,这就是我想出的:

from difflib import SequenceMatcher as sm

def lev_using_difflib(s1, s2):
a = b = size = distance = 0
for m in sm(a=s1, b=s2).get_matching_blocks():
distance += max(m.a-a, m.b-b) - size
a, b, size = m
return distance

我想不出失败的测试用例,性能似乎明显优于标准算法。

下面是依赖 difflib 的 levenshtein 算法的结果:

>>> from timeit import Timer
>>> setup = """
... from difflib import SequenceMatcher as sm
...
... def lev_using_difflib(s1, s2):
... a = b = size = distance = 0
... for m in sm(a=s1, b=s2).get_matching_blocks():
... distance += max(m.a-a, m.b-b) - size
... a, b, size = m
... return distance
...
... strings = [('sunday','saturday'),
... ('fitting','babysitting'),
... ('rosettacode','raisethysword')]
... """
>>> stmt = """
... for s in strings:
... lev_using_difflib(*s)
... """
>>> Timer(stmt, setup).timeit(100000)
36.989389181137085

这是标准的纯 Python 实现:

>>> from timeit import Timer
>>> setup2 = """
... def levenshtein(a,b):
... n, m = len(a), len(b)
... if n > m:
... a,b = b,a
... n,m = m,n
...
... current = range(n+1)
... for i in range(1,m+1):
... previous, current = current, [i]+[0]*n
... for j in range(1,n+1):
... add, delete = previous[j]+1, current[j-1]+1
... change = previous[j-1]
... if a[j-1] != b[i-1]:
... change = change + 1
... current[j] = min(add, delete, change)
...
... return current[n]
...
... strings = [('sunday','saturday'),
... ('fitting','babysitting'),
... ('rosettacode','raisethysword')]
... """
>>> stmt2 = """
... for s in strings:
... levenshtein(*s)
... """
>>> Timer(stmt2, setup2).timeit(100000)
55.594768047332764

使用difflib的SequenceMatcher的算法性能真的更好吗?还是它依赖于使比较完全无效的 C 库?如果它依赖于 C 扩展,我如何通过查看 difflib.py 实现来判断?

使用 Python 2.7.3 [GCC 4.2.1 (Apple Inc. build 5666)]

预先感谢您的帮助!

最佳答案

>>> levenshtein('hello', 'world')
4
>>> lev_using_difflib('hello', 'world')
5

关于python - Python 的 difflib 中的 SequenceMatcher 是否有可能提供一种更有效的方法来计算 Levenshtein 距离?,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/12659199/

25 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com