gpt4 book ai didi

python - 高效地生成一个字符串所有可能子串的列表

转载 作者:太空宇宙 更新时间:2023-11-04 01:45:21 25 4
gpt4 key购买 nike

我想编写一个函数,根据子字符串的最小和最大长度有效地返回一个字符串的所有可能子字符串的列表。 (字符串仅包含大写字母。)

例如,对于字符串'THISISASTRING',对于min_length=3max_length=4,它应该返回:

['THI', 'THIS', 'HIS', 'HISI', 'ISI', 'ISIS', 'SIS', 'SISA', 'ISA',
'ISAS', 'SAS', 'SAST', 'AST', 'ASTR', 'STR', 'STRI', 'TRI', 'TRIN',
'RIN', 'RING', 'ING']

我正在寻找一种比我目前的解决方案快得多的解决方案:

import cProfile

random_english_text = \
'AHOUSEISABUILDINGTHATISMADEFORPEOPLETOLIVEINITISAPERMANENTBUILDINGTHATISMEANTTOSTAYSTANDINGITISNOTEASILYPACKEDU' \
'PANDCARRIEDAWAYLIKEATENTORMOVEDLIKEACARAVANIFPEOPLELIVEINTHESAMEHOUSEFORMORETHANASHORTSTAYTHENTHEYCALLITTHEIRHO' \
'MEBEINGWITHOUTAHOMEISCALLEDHOMELESSNESSHOUSESCOMEINMANYDIFFERENTSHAPESANDSIZESTHEYMAYBEASSMALLASJUSTONEROOMORTH' \
'EYMAYHAVEHUNDREDSOFROOMSTHEYALSOAREMADEMANYDIFFERENTSHAPESANDMAYHAVEJUSTONELEVELORSEVERALDIFFERENTLEVELSAHOUSEI' \
'SSOMETIMESJOINEDTOOTHERHOUSESATTHESIDESTOMAKEATERRACEORROWHOUSEACONNECTEDROWOFHOUSES'

def assemble_substrings(textstring, length_min, length_max):
str_len = len(textstring)
subStringList = []
idx = 0
while idx <= str_len - length_min:
max_depth = min(length_max, str_len - idx)
for i in list(range(length_min, max_depth + 1)):
subString = textstring[idx:idx + i]
subStringList.append(subString)
idx += 1
return subStringList


pr = cProfile.Profile()
pr.enable()

for i in range(0, 1000):
list_of_substrings = assemble_substrings(textstring=random_english_text, length_min=4, length_max=10)

pr.disable()
pr.print_stats(sort='cumtime')

这会产生我:

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
1000 1.332 0.001 1.672 0.002 <input>:11(assemble_substrings)
3654000 0.227 0.000 0.227 0.000 {method 'append' of 'list' objects}
525000 0.112 0.000 0.112 0.000 {built-in method builtins.min}
1000 0.000 0.000 0.000 0.000 {built-in method builtins.len}
1 0.000 0.000 0.000 0.000 {method 'disable' of '_lsprof.Profiler' objects}

现在,从分析器的输出中,我对如何加速此功能没有太多了解。

使此功能尽可能快的最佳方法是什么?我应该使用不同于列表的数据结构吗?使用赛通?或者在外部 C/C++ 共享对象中编写此代码?

将非常感谢您的投入,一般来说也包括如何在 Python 中有效地处理字符串和类似于上述字符串的操作。

最佳答案

为什么不简单地使用 2 个范围和字符串切片的列表理解?

t = "SOMETEXT"

print(t)

minl = 3
maxl = 8

parts = [t[i:i+j] for i in range(len(t)-minl) for j in range(minl,maxl+1)]

print(parts)

输出:

['SOM', 'SOME', 'SOMET', 'SOMETE', 'SOMETEX', 'SOMETEXT', 'OME', 'OMET', 'OMETE', 'OMETEX', 
'OMETEXT', 'OMETEXT', 'MET', 'METE', 'METEX', 'METEXT', 'METEXT', 'METEXT', 'ETE', 'ETEX',
'ETEXT', 'ETEXT', 'ETEXT', 'ETEXT', 'TEX', 'TEXT', 'TEXT', 'TEXT', 'TEXT', 'TEXT']

如果顺序不重要,您可以使用集合来删除重复项 - 否则为有序存储创建一个唯一列表:

nodupes = [] 
k = set()
for l in parts:
if l in k:
pass
else:
nodupes.append(l)
k.add(l)

print(nodupes)

输出:

['SOM', 'SOME', 'SOMET', 'SOMETE', 'SOMETEX', 'SOMETEXT', 'OME', 'OMET', 'OMETE', 'OMETEX', 
'OMETEXT', 'MET', 'METE', 'METEX', 'METEXT', 'ETE', 'ETEX', 'ETEXT', 'TEX', 'TEXT']

随着时间的推移:

def doit(t,minl,maxl):
parts = [t[i:i+j] for i in range(len(t)-minl) for j in range(minl,maxl+1)]
return parts

pr = cProfile.Profile()
pr.enable()

for i in range(0, 1000):
list_of_substrings = doit(random_english_text, 4, 10)

pr.disable()
pr.print_stats(sort='cumtime')

         3001 function calls in 0.597 seconds

Ordered by: cumulative time

ncalls tottime percall cumtime percall filename:lineno(function)
1000 0.001 0.000 0.597 0.001 main.py:10(doit)
1000 0.596 0.001 0.596 0.001 main.py:11(<listcomp>)
1000 0.000 0.000 0.000 0.000 {built-in method builtins.len}
1 0.000 0.000 0.000 0.000 {method 'disable' of '_lsprof.Profiler' objects}

您的给出:在 1.614 秒内调用了 4181001 次函数

关于python - 高效地生成一个字符串所有可能子串的列表,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/54854762/

25 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com