gpt4 book ai didi

python - 为 Python 查找最长重复字符串的有效方法(来自 Programming Pearls)

转载 作者:太空狗 更新时间:2023-10-29 18:19:25 26 4
gpt4 key购买 nike

摘自《编程珠玑》15.2节

可在此处查看 C 代码:http://www.cs.bell-labs.com/cm/cs/pearls/longdup.c

当我使用后缀数组在 Python 中实现它时:

example = open("iliad10.txt").read()
def comlen(p, q):
i = 0
for x in zip(p, q):
if x[0] == x[1]:
i += 1
else:
break
return i

suffix_list = []
example_len = len(example)
idx = list(range(example_len))
idx.sort(cmp = lambda a, b: cmp(example[a:], example[b:])) #VERY VERY SLOW

max_len = -1
for i in range(example_len - 1):
this_len = comlen(example[idx[i]:], example[idx[i+1]:])
print this_len
if this_len > max_len:
max_len = this_len
maxi = i

我发现 idx.sort 步骤非常慢。我认为它很慢,因为 Python 需要按值而不是指针传递子字符串(如上面的 C 代码)。

测试文件可以从here下载

C 代码仅需 0.3 秒即可完成。

time cat iliad10.txt |./longdup 
On this the rest of the Achaeans with one voice were for
respecting the priest and taking the ransom that he offered; but
not so Agamemnon, who spoke fiercely to him and sent him roughly
away.

real 0m0.328s
user 0m0.291s
sys 0m0.006s

但是对于Python代码,它在我的电脑上永远不会结束(我等了10分钟就杀了它)

有没有人知道如何使代码高效? (例如,少于 10 秒)

最佳答案

我的解决方案基于后缀数组。它是通过前缀加倍 最长公共(public)前缀 构造的。最坏情况下的复杂度是 O(n (log n)^2)。文件“iliad.mb.txt”在我的笔记本电脑上需要 4 秒。 longest_common_substring 函数很短,可以很容易地修改,例如用于搜索 10 个最长的非重叠子串。此 Python 代码比 original C code 更快从问题中,如果重复的字符串超过 10000 个字符。

from itertools import groupby
from operator import itemgetter

def longest_common_substring(text):
"""Get the longest common substrings and their positions.
>>> longest_common_substring('banana')
{'ana': [1, 3]}
>>> text = "not so Agamemnon, who spoke fiercely to "
>>> sorted(longest_common_substring(text).items())
[(' s', [3, 21]), ('no', [0, 13]), ('o ', [5, 20, 38])]

This function can be easy modified for any criteria, e.g. for searching ten
longest non overlapping repeated substrings.
"""
sa, rsa, lcp = suffix_array(text)
maxlen = max(lcp)
result = {}
for i in range(1, len(text)):
if lcp[i] == maxlen:
j1, j2, h = sa[i - 1], sa[i], lcp[i]
assert text[j1:j1 + h] == text[j2:j2 + h]
substring = text[j1:j1 + h]
if not substring in result:
result[substring] = [j1]
result[substring].append(j2)
return dict((k, sorted(v)) for k, v in result.items())

def suffix_array(text, _step=16):
"""Analyze all common strings in the text.

Short substrings of the length _step a are first pre-sorted. The are the
results repeatedly merged so that the garanteed number of compared
characters bytes is doubled in every iteration until all substrings are
sorted exactly.

Arguments:
text: The text to be analyzed.
_step: Is only for optimization and testing. It is the optimal length
of substrings used for initial pre-sorting. The bigger value is
faster if there is enough memory. Memory requirements are
approximately (estimate for 32 bit Python 3.3):
len(text) * (29 + (_size + 20 if _size > 2 else 0)) + 1MB

Return value: (tuple)
(sa, rsa, lcp)
sa: Suffix array for i in range(1, size):
assert text[sa[i-1]:] < text[sa[i]:]
rsa: Reverse suffix array for i in range(size):
assert rsa[sa[i]] == i
lcp: Longest common prefix for i in range(1, size):
assert text[sa[i-1]:sa[i-1]+lcp[i]] == text[sa[i]:sa[i]+lcp[i]]
if sa[i-1] + lcp[i] < len(text):
assert text[sa[i-1] + lcp[i]] < text[sa[i] + lcp[i]]
>>> suffix_array(text='banana')
([5, 3, 1, 0, 4, 2], [3, 2, 5, 1, 4, 0], [0, 1, 3, 0, 0, 2])

Explanation: 'a' < 'ana' < 'anana' < 'banana' < 'na' < 'nana'
The Longest Common String is 'ana': lcp[2] == 3 == len('ana')
It is between tx[sa[1]:] == 'ana' < 'anana' == tx[sa[2]:]
"""
tx = text
size = len(tx)
step = min(max(_step, 1), len(tx))
sa = list(range(len(tx)))
sa.sort(key=lambda i: tx[i:i + step])
grpstart = size * [False] + [True] # a boolean map for iteration speedup.
# It helps to skip yet resolved values. The last value True is a sentinel.
rsa = size * [None]
stgrp, igrp = '', 0
for i, pos in enumerate(sa):
st = tx[pos:pos + step]
if st != stgrp:
grpstart[igrp] = (igrp < i - 1)
stgrp = st
igrp = i
rsa[pos] = igrp
sa[i] = pos
grpstart[igrp] = (igrp < size - 1 or size == 0)
while grpstart.index(True) < size:
# assert step <= size
nextgr = grpstart.index(True)
while nextgr < size:
igrp = nextgr
nextgr = grpstart.index(True, igrp + 1)
glist = []
for ig in range(igrp, nextgr):
pos = sa[ig]
if rsa[pos] != igrp:
break
newgr = rsa[pos + step] if pos + step < size else -1
glist.append((newgr, pos))
glist.sort()
for ig, g in groupby(glist, key=itemgetter(0)):
g = [x[1] for x in g]
sa[igrp:igrp + len(g)] = g
grpstart[igrp] = (len(g) > 1)
for pos in g:
rsa[pos] = igrp
igrp += len(g)
step *= 2
del grpstart
# create LCP array
lcp = size * [None]
h = 0
for i in range(size):
if rsa[i] > 0:
j = sa[rsa[i] - 1]
while i != size - h and j != size - h and tx[i + h] == tx[j + h]:
h += 1
lcp[rsa[i]] = h
if h > 0:
h -= 1
if size > 0:
lcp[0] = 0
return sa, rsa, lcp

more complicated O(n log n) 相比,我更喜欢这个解决方案因为 Python 有一个非常快的列表排序算法 (Timsort) . Python 的排序可能比那篇文章的方法中必要的线性时间操作更快,在非常特殊的随机字符串和小字母表(典型的 DNA 基因组分析)的假设下,它应该是 O(n)。我读了 Gog 2011我算法的最坏情况 O(n log n) 实际上比许多不能使用 CPU 内存缓存的 O(n) 算法更快。

另一个答案中的代码基于 grow_chains如果文本包含 8 kB 长的重复字符串,则比问题的原始示例慢 19 倍。长时间重复的文本在古典文学中并不典型,但它们很常见,例如在“独立”学校的家庭作业集中。该程序不应在其上卡住。

我写了an example and tests with the same code适用于 Python 2.7、3.3 - 3.6。

关于python - 为 Python 查找最长重复字符串的有效方法(来自 Programming Pearls),我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/13560037/

26 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com