gpt4 book ai didi

regex - 为什么这个递归正则表达式只在字符重复 2^n - 1 次时才匹配?

转载 作者:行者123 更新时间:2023-12-04 14:19:38 24 4
gpt4 key购买 nike

在阅读了 polygenelubricants 的关于高级正则表达式技术的系列文章(特别是 How does this Java regex detect palindromes? )后,我决定尝试创建自己的 PCRE 正则表达式来解析回文,使用递归(在 PHP 中)。

我想出的是:

^(([a-z])(?1)\2|[a-z]?)$

我对这个表达式的理解是它应该匹配零个或一个字符(每个少于 2 个字符的字符串隐含地是一个回文,以及在递归中解释具有奇数长度的回文),或者两个相同字符的分隔通过模式的递归。

不幸的是,它似乎并没有那样工作,正如您在 www.ideone.com/a9T3F 看到的那样。相反,只有 2n - 1 的字符串(即空字符串、 aaaaaaaaaaa 、 a15)与正则表达式匹配。

奇怪的是,如果我修改我的模式以便递归是可选的(即 ^(([a-z])(?1)?\2|[a-z]?)$ ,参见 www.ideone.com/D6lJR ,它只匹配具有重复 2n 次字符的字符串(即空字符串, aaa , 404,607914 , 404,607)。

为什么我的正则表达式没有按照我期望的方式工作?

急于建议不要使用正则表达式的人请注意: 这个问题的重点是学习如何正确使用递归正则表达式。我知道这不是确定字符串是否为回文的有效方法,如果由于某种原因不得不在生产代码中确定回文,我不会使用递归正则表达式;我只是有兴趣了解有关正则表达式高级方面的更多信息。

最佳答案

您观察到的现象是由于 PCRE 子模式递归是原子的,与 Perl 不同。 man page实际上非常详细地涵盖了这个问题:

In PCRE (like Python, but unlike Perl), a recursive subpattern call is always treated as an atomic group. That is, once it has matched some of the subject string, it is never re-entered, even if it contains untried alternatives and there is a subsequent matching failure.

This can be illustrated by the following pattern, which purports to match a palindromic string that contains an odd number of characters (for example, "a", "aba", "abcba", "abcdcba"):

    ^(.|(.)(?1)\2)$

The idea is that it either matches a single character, or two identical characters surrounding a sub-palindrome. In Perl, this pattern works; in PCRE it does not if the pattern is longer than three characters.

Consider the subject string "abcba":

At the top level, the first character is matched, but as it is not at the end of the string, the first alternative fails; the second alternative is taken and the recursion kicks in. The recursive call to subpattern 1 successfully matches the next character ("b"). (Note that the beginning and end of line tests are not part of the recursion).

Back at the top level, the next character ("c") is compared with what subpattern 2 matched, which was "a". This fails. Because the recursion is treated as an atomic group, there are now no backtracking points, and so the entire match fails. (Perl is able, at this point, to re- enter the recursion and try the second alternative.) However, if the pattern is written with the alternatives in the other order, things are different:

    ^((.)(?1)\2|.)$

This time, the recursing alternative is tried first, and continues to recurse until it runs out of characters, at which point the recursion fails. But this time we do have another alternative to try at the higher level. That is the big difference: in the previous case the remaining alternative is at a deeper recursion level, which PCRE cannot use.

To change the pattern so that matches all palindromic strings, not just those with an odd number of characters, it is tempting to change the pattern to this:

    ^((.)(?1)\2|.?)$

Again, this works in Perl, but not in PCRE, and for the same reason. When a deeper recursion has matched a single character, it cannot be entered again in order to match an empty string. The solution is to separate the two cases, and write out the odd and even cases as alternatives at the higher level:

    ^(?:((.)(?1)\2|)|((.)(?3)\4|.))$

WARNING!!!

The palindrome-matching patterns above work only if the subject string does not start with a palindrome that is shorter than the entire string. For example, although "abcba" is correctly matched, if the subject is "ababa", PCRE finds the palindrome "aba" at the start, then fails at top level because the end of the string does not follow. Once again, it cannot jump back into the recursion to try other alternatives, so the entire match fails.



其他引用
  • regular-expressions.info/Atomic grouping
  • (?>…)在某些方面是原子分组语法
  • 环视 (?=…) , (?!…) , (?<=…) , (?<!…) , 都是原子的
  • Possessive量词(例如 a*+ )也是原子的
  • PCRE 递归子模式和子程序调用也是原子的


  • 仔细看看图案

    原子性论点是正确的,但它如何解释为什么模式表现得像观察到的那样可能并不明显。让我们仔细看看,看看这一切是如何契合的:

    我们将使用第一个模式:
    ^(([a-z])(?1)\2|[a-z]?)$

    我将使用以下符号来表示递归:
  • 1表示该角色在第一个备选
  • 中被捕获到第 2 组
  • 2表示该字符与第二个备选匹配
  • 或者如果 2不高于字符,? 的零重复选项被行使
  • \表示该字符与第一个备用
  • 中第 2 组的反向引用匹配
  • _表示递归分支的底部
  • 即使有其他选择,也不会重新进入此分支!

  • 现在让我们考虑 "aaa"作为输入:
          _
    1 1 1 2
    a a a # This is the first bottom of the recursion,
    # now we go back to the third 1 and try to match \.
    # This fails, so the third 1 becomes 2.
    _
    1 1 2
    a a a # Now we go back to the second 1 and try to match \.
    # This fails, so the second 1 becomes 2.
    _
    1 2
    a a a # The second level matched! now we go back to the first level...

    _____
    1 2 \
    a a a # Now the first 1 can match \, and entire pattern matches!!

    现在考虑 "aaaaa" :
              _
    1 1 1 1 1 2
    a a a a a # Fifth 1 can't match \, so it becomes 2.
    _
    1 1 1 1 2
    a a a a a # Fourth 1 can't match \, so it becomes 2.
    _____
    1 1 1 2 /
    a a a a a # Here's a crucial point. The third 1 successfully matched.
    # Now we're back to the second 1 and try to match \, but this fails.
    # However, since PCRE recursion is atomic, the third 1 will NOT be
    # reentered to try 2. Instead, we try 2 on the second 1.
    _____
    1 2 \
    a a a a a # Anchors don't match, so the first 1 becomes 2, and then also the
    # anchors don't match, so the pattern fails to match.

    请注意,一旦递归级别与第一个备选方案匹配,则将来不会尝试第二个备选方案(即使这样做可能导致可能匹配),因为 PCRE 子模式递归是原子的。

    现在考虑 "aa" :
        _
    1 1 2
    a a
    _
    1 2
    a a # The second level matched by taking the one repetition option on ?.
    # We now go back to the first level, and we can't match \.
    # Since PCRE recursion is atomic, we can't go back to the second level
    # to try the zero repetition option on ?.
    _
    2
    a a # Anchors don't match, trying zero option on ? also doesn't help,
    # so the pattern fails to match!

    请注意,一旦递归级别与 ? 的一次重复匹配在第二种选择中,将来不会尝试零重复选项(即使这样做可能导致可能匹配),因为 PCRE 子模式递归是原子的。

    现在让我们考虑 aaaaaaa
                  _
    1 1 1 1 1 1 1 2
    a a a a a a a
    _
    1 1 1 1 1 1 2
    a a a a a a a
    _____
    1 1 1 1 1 2 \
    a a a a a a a # A crucial point: the fifth level matched and now the fourth
    # level can't match \, but it does NOT reenter the fifth level to
    # try 2. Instead, the fourth level tries 2.
    _____
    1 1 1 2 \
    a a a a a a a
    _________
    1 1 1 2 \ \
    a a a a a a a
    _____________
    1 1 1 2 \ \ \
    a a a a a a a # Entire pattern is a match!

    请注意,即使 PCRE 子模式递归是原子的,它仍然可以成功匹配由重复 2n-1 次的字符组成的回文。

    现在,为了好玩,让我们试试 "abcba" :
              _
    1 1 1 1 1 2
    a b c b a
    _
    1 1 1 1 2
    a b c b a

    1 1 1 2
    a b c b a # Third level attempts \, but c does not match a!
    # So we go back to third 1 and try 2.
    _____
    1 1 2 \
    a b c b a
    _________
    1 1 2 \ \
    a b c b a # Entire pattern is a match!

    也就是说,模式不仅仅匹配“仅当一个字符重复 2n-1 次时”。确实可以匹配 "abcba" ( as seen on ideone.com )。然而,它不能匹配 "ababa" ,也不能匹配 "aaaaa" (请参阅手册页上的警告!),因为 PCRE 中的子模式递归是原子的。

    您可以应用相同的跟踪过程来解释模式在任何输入上的行为。

    关于regex - 为什么这个递归正则表达式只在字符重复 2^n - 1 次时才匹配?,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/3738631/

    24 4 0
    Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
    广告合作:1813099741@qq.com 6ren.com