gpt4 book ai didi

python - 在 Python 3 中加速数百万个正则表达式替换

转载 作者:IT老高 更新时间:2023-10-28 21:05:51 25 4
gpt4 key购买 nike

我有两个列表:

  • 大约 75 万个“句子”(长字符串)的列表
  • 我想从我的 75 万个句子中删除的大约 2 万个“单词”列表

所以,我必须遍历 750K sentences 并执行大约 20K 替换,但前提是我的单词实际上是 “单词” 并且不属于更大的字符串。

我通过预编译我的 单词 来做到这一点,以便它们的两侧是 \b 字边界元字符:

compiled_words = [re.compile(r'\b' + word + r'\b') for word in my20000words]

然后我遍历我的“句子”:

import re

for sentence in sentences:
for word in compiled_words:
sentence = re.sub(word, "", sentence)
# put sentence into a growing list

这个嵌套循环每秒处理大约 50 个句子,这很好,但处理我的所有句子仍然需要几个小时。

  • 有没有办法使用 str.replace 方法(我认为它更快),但仍然要求替换只发生在 单词边界

  • 或者,有没有办法加快 re.sub 方法?如果我的单词长度大于我的句子长度,我已经通过跳过 re.sub 略微提高了速度,但这并没有太大的改进。

我正在使用 Python 3.5.2

最佳答案

TLDR

如果您想要最快的基于正则表达式的解决方案,请使用此方法。对于类似于 OP 的数据集,它比接受的答案快大约 1000 倍。

如果您不关心正则表达式,请使用 this set-based version ,这比正则表达式联合快 2000 倍。

使用 Trie 优化正则表达式

一个 simple Regex union由于正则表达式引擎doesn't do a very good job,许多禁用词的方法变得缓慢优化模式。

可以创建 Trie用所有被禁止的词并写出相应的正则表达式。生成的 trie 或 regex 并不是真正的人类可读的,但它们确实允许非常快速的查找和匹配。

示例

['foobar', 'foobah', 'fooxar', 'foozap', 'fooza']

Regex union

列表被转换为trie:

{
'f': {
'o': {
'o': {
'x': {
'a': {
'r': {
'': 1
}
}
},
'b': {
'a': {
'r': {
'': 1
},
'h': {
'': 1
}
}
},
'z': {
'a': {
'': 1,
'p': {
'': 1
}
}
}
}
}
}
}

然后到这个正则表达式模式:

r"\bfoo(?:ba[hr]|xar|zap?)\b"

Regex trie

最大的优势是测试 zoo 是否匹配,正则表达式引擎只有 needs to compare the first character (不匹配),而不是 trying the 5 words .这是 5 个单词的预处理过度,但它显示了数千个单词的有希望的结果。

请注意 (?:) non-capturing groups 使用是因为:

代码

这里是稍作修改的 gist ,我们可以将其用作 trie.py 库:

import re


class Trie():
"""Regex::Trie in Python. Creates a Trie out of a list of words. The trie can be exported to a Regex pattern.
The corresponding Regex should match much faster than a simple Regex union."""

def __init__(self):
self.data = {}

def add(self, word):
ref = self.data
for char in word:
ref[char] = char in ref and ref[char] or {}
ref = ref[char]
ref[''] = 1

def dump(self):
return self.data

def quote(self, char):
return re.escape(char)

def _pattern(self, pData):
data = pData
if "" in data and len(data.keys()) == 1:
return None

alt = []
cc = []
q = 0
for char in sorted(data.keys()):
if isinstance(data[char], dict):
try:
recurse = self._pattern(data[char])
alt.append(self.quote(char) + recurse)
except:
cc.append(self.quote(char))
else:
q = 1
cconly = not len(alt) > 0

if len(cc) > 0:
if len(cc) == 1:
alt.append(cc[0])
else:
alt.append('[' + ''.join(cc) + ']')

if len(alt) == 1:
result = alt[0]
else:
result = "(?:" + "|".join(alt) + ")"

if q:
if cconly:
result += "?"
else:
result = "(?:%s)?" % result
return result

def pattern(self):
return self._pattern(self.dump())

测试

这是一个小测试(与 this one 相同):

# Encoding: utf-8
import re
import timeit
import random
from trie import Trie

with open('/usr/share/dict/american-english') as wordbook:
banned_words = [word.strip().lower() for word in wordbook]
random.shuffle(banned_words)

test_words = [
("Surely not a word", "#surely_NöTäWORD_so_regex_engine_can_return_fast"),
("First word", banned_words[0]),
("Last word", banned_words[-1]),
("Almost a word", "couldbeaword")
]

def trie_regex_from_words(words):
trie = Trie()
for word in words:
trie.add(word)
return re.compile(r"\b" + trie.pattern() + r"\b", re.IGNORECASE)

def find(word):
def fun():
return union.match(word)
return fun

for exp in range(1, 6):
print("\nTrieRegex of %d words" % 10**exp)
union = trie_regex_from_words(banned_words[:10**exp])
for description, test_word in test_words:
time = timeit.timeit(find(test_word), number=1000) * 1000
print(" %s : %.1fms" % (description, time))

它输出:

TrieRegex of 10 words
Surely not a word : 0.3ms
First word : 0.4ms
Last word : 0.5ms
Almost a word : 0.5ms

TrieRegex of 100 words
Surely not a word : 0.3ms
First word : 0.5ms
Last word : 0.9ms
Almost a word : 0.6ms

TrieRegex of 1000 words
Surely not a word : 0.3ms
First word : 0.7ms
Last word : 0.9ms
Almost a word : 1.1ms

TrieRegex of 10000 words
Surely not a word : 0.1ms
First word : 1.0ms
Last word : 1.2ms
Almost a word : 1.2ms

TrieRegex of 100000 words
Surely not a word : 0.3ms
First word : 1.2ms
Last word : 0.9ms
Almost a word : 1.6ms

有关信息,正则表达式的开头如下:

(?:a(?:(?:\'s|a(?:\'s|chen|liyah(?:\'s)?|r(?:dvark(?:(?:\'s|s))?|on))|b(?:\'s|a(?:c(?:us(?:(?:\'s|es))?|[ik])|ft|lone(?:(?:\'s|s))?|ndon(?:(?:ed|ing|ment(?:\'s)?|s))?|s(?:e(?:(?:ment(?:\'s)?|[ds]))?|h(?:(?:e[ds]|ing))?|ing)|t(?:e(?:(?:ment(?:\'s)?|[ds]))?|ing|toir(?:(?:\'s|s))?))|b(?:as(?:id)?|e(?:ss(?:(?:\'s|es))?|y(?:(?:\'s|s))?)|ot(?:(?:\'s|t(?:\'s)?|s))?|reviat(?:e[ds]?|i(?:ng|on(?:(?:\'s|s))?))|y(?:\'s)?|\é(?:(?:\'s|s))?)|d(?:icat(?:e[ds]?|i(?:ng|on(?:(?:\'s|s))?))|om(?:en(?:(?:\'s|s))?|inal)|u(?:ct(?:(?:ed|i(?:ng|on(?:(?:\'s|s))?)|or(?:(?:\'s|s))?|s))?|l(?:\'s)?))|e(?:(?:\'s|am|l(?:(?:\'s|ard|son(?:\'s)?))?|r(?:deen(?:\'s)?|nathy(?:\'s)?|ra(?:nt|tion(?:(?:\'s|s))?))|t(?:(?:t(?:e(?:r(?:(?:\'s|s))?|d)|ing|or(?:(?:\'s|s))?)|s))?|yance(?:\'s)?|d))?|hor(?:(?:r(?:e(?:n(?:ce(?:\'s)?|t)|d)|ing)|s))?|i(?:d(?:e[ds]?|ing|jan(?:\'s)?)|gail|l(?:ene|it(?:ies|y(?:\'s)?)))|j(?:ect(?:ly)?|ur(?:ation(?:(?:\'s|s))?|e[ds]?|ing))|l(?:a(?:tive(?:(?:\'s|s))?|ze)|e(?:(?:st|r))?|oom|ution(?:(?:\'s|s))?|y)|m\'s|n(?:e(?:gat(?:e[ds]?|i(?:ng|on(?:\'s)?))|r(?:\'s)?)|ormal(?:(?:it(?:ies|y(?:\'s)?)|ly))?)|o(?:ard|de(?:(?:\'s|s))?|li(?:sh(?:(?:e[ds]|ing))?|tion(?:(?:\'s|ist(?:(?:\'s|s))?))?)|mina(?:bl[ey]|t(?:e[ds]?|i(?:ng|on(?:(?:\'s|s))?)))|r(?:igin(?:al(?:(?:\'s|s))?|e(?:(?:\'s|s))?)|t(?:(?:ed|i(?:ng|on(?:(?:\'s|ist(?:(?:\'s|s))?|s))?|ve)|s))?)|u(?:nd(?:(?:ed|ing|s))?|t)|ve(?:(?:\'s|board))?)|r(?:a(?:cadabra(?:\'s)?|d(?:e[ds]?|ing)|ham(?:\'s)?|m(?:(?:\'s|s))?|si(?:on(?:(?:\'s|s))?|ve(?:(?:\'s|ly|ness(?:\'s)?|s))?))|east|idg(?:e(?:(?:ment(?:(?:\'s|s))?|[ds]))?|ing|ment(?:(?:\'s|s))?)|o(?:ad|gat(?:e[ds]?|i(?:ng|on(?:(?:\'s|s))?)))|upt(?:(?:e(?:st|r)|ly|ness(?:\'s)?))?)|s(?:alom|c(?:ess(?:(?:\'s|e[ds]|ing))?|issa(?:(?:\'s|[es]))?|ond(?:(?:ed|ing|s))?)|en(?:ce(?:(?:\'s|s))?|t(?:(?:e(?:e(?:(?:\'s|ism(?:\'s)?|s))?|d)|ing|ly|s))?)|inth(?:(?:\'s|e(?:\'s)?))?|o(?:l(?:ut(?:e(?:(?:\'s|ly|st?))?|i(?:on(?:\'s)?|sm(?:\'s)?))|v(?:e[ds]?|ing))|r(?:b(?:(?:e(?:n(?:cy(?:\'s)?|t(?:(?:\'s|s))?)|d)|ing|s))?|pti...

真的不可读,但是对于 100000 个禁用词的列表,这个 Trie 正则表达式比简单的正则表达式联合快 1000 倍!

这是一个完整的 trie 图表,用 trie-python-graphviz 导出和graphviz twopi :

Enter image description here

关于python - 在 Python 3 中加速数百万个正则表达式替换,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/42742810/

25 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com