gpt4 book ai didi

python - 如何有效地将字符串与一组通配符字符串进行匹配?

转载 作者:太空宇宙 更新时间:2023-11-03 14:30:05 25 4
gpt4 key购买 nike

我正在寻找一种解决方案来将单个字符串与一组通配符字符串进行匹配。例如

>>> match("ab", ["a*", "b*", "*", "c", "*b"])
["a*", "*", "*b"]

输出的顺序并不重要。

我将按照 10^4 个通配符字符串的顺序进行匹配,我将执行大约 ~10^9 个匹配调用。这意味着我可能不得不像这样重写我的代码:

>>> matcher = prepare(["a*", "b*", "*", "c", "*b"]
>>> for line in lines: yield matcher.match("ab")
["a*", "*", "*b"]

我已经开始用 Python 编写一个 trie 实现来处理通配符,我只需要正确处理这些极端情况。尽管如此,我还是很想听听; 你会如何解决这个问题?是否有任何 Python 库可以让我更快地解决这个问题?

到目前为止的一些见解:

  • 命名(Python,重新)正则表达式在这里对我没有帮助,因为它们只会返回一个匹配项。
  • pyparsing看起来像是一个很棒的库,但文档很少,而且在我看来,它不支持匹配多种模式。

最佳答案

您可以使用 re2 library 中的 FilteredRE2 类在 Aho-Corasick 算法实现(或类似算法)的帮助下。来自 re2 docs :

Required substrings. Suppose you have an efficient way to check which of a list of strings appear as substrings in a large text (for example, maybe you implemented the Aho-Corasick algorithm), but now your users want to be able to do regular expression searches efficiently too. Regular expressions often have large literal strings in them; if those could be identified, they could be fed into the string searcher, and then the results of the string searcher could be used to filter the set of regular expression searches that are necessary. The FilteredRE2 class implements this analysis. Given a list of regular expressions, it walks the regular expressions to compute a boolean expression involving literal strings and then returns the list of strings. For example, FilteredRE2 converts (hello|hi)world[a-z]+foo into the boolean expression “(helloworld OR hiworld) AND foo” and returns those three strings. Given multiple regular expressions, FilteredRE2 converts each into a boolean expression and returns all the strings involved. Then, after being told which of the strings are present, FilteredRE2 can evaluate each expression to identify the set of regular expressions that could possibly be present. This filtering can reduce the number of actual regular expression searches significantly.

The feasibility of these analyses depends crucially on the simplicity of their input. The first uses the DFA form, while the second uses the parsed regular expression (Regexp*). These kind of analyses would be more complicated (maybe even impossible) if RE2 allowed non-regular features in its regular expressions.

关于python - 如何有效地将字符串与一组通配符字符串进行匹配?,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/12904860/

25 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com