regex - 为什么这个递归正则表达式只在字符重复 2^n

regex - 为什么这个递归正则表达式只在字符重复 2^n - 1 次时才匹配？

转载作者：行者123 更新时间：2023-12-04 14:19:38

在阅读了 polygenelubricants 的关于高级正则表达式技术的系列文章(特别是 How does this Java regex detect palindromes? )后，我决定尝试创建自己的 PCRE 正则表达式来解析回文，使用递归(在 PHP 中)。

我想出的是:

^(([a-z])(?1)\2|[a-z]?)$

我对这个表达式的理解是它应该匹配零个或一个字符(每个少于 2 个字符的字符串隐含地是一个回文，以及在递归中解释具有奇数长度的回文)，或者两个相同字符的分隔通过模式的递归。

不幸的是，它似乎并没有那样工作，正如您在 www.ideone.com/a9T3F 看到的那样。相反，只有 2n - 1 的字符串(即空字符串、 a 、 aaa 、 aaaaaaa 、 a15)与正则表达式匹配。

奇怪的是，如果我修改我的模式以便递归是可选的(即 ^(([a-z])(?1)?\2|[a-z]?)$ ，参见 www.ideone.com/D6lJR ，它只匹配具有重复 2n 次字符的字符串(即空字符串， a ， aa , 404,607914 , 404,607)。

为什么我的正则表达式没有按照我期望的方式工作？

急于建议不要使用正则表达式的人请注意: 这个问题的重点是学习如何正确使用递归正则表达式。我知道这不是确定字符串是否为回文的有效方法，如果由于某种原因不得不在生产代码中确定回文，我不会使用递归正则表达式；我只是有兴趣了解有关正则表达式高级方面的更多信息。

最佳答案

您观察到的现象是由于 PCRE 子模式递归是原子的，与 Perl 不同。 man page实际上非常详细地涵盖了这个问题:

In PCRE (like Python, but unlike Perl), a recursive subpattern call is always treated as an atomic group. That is, once it has matched some of the subject string, it is never re-entered, even if it contains untried alternatives and there is a subsequent matching failure.

This can be illustrated by the following pattern, which purports to match a palindromic string that contains an odd number of characters (for example, "a", "aba", "abcba", "abcdcba"):
    ^(.|(.)(?1)\2)$
The idea is that it either matches a single character, or two identical characters surrounding a sub-palindrome. In Perl, this pattern works; in PCRE it does not if the pattern is longer than three characters.

Consider the subject string "abcba":

At the top level, the first character is matched, but as it is not at the end of the string, the first alternative fails; the second alternative is taken and the recursion kicks in. The recursive call to subpattern 1 successfully matches the next character ("b"). (Note that the beginning and end of line tests are not part of the recursion).

Back at the top level, the next character ("c") is compared with what subpattern 2 matched, which was "a". This fails. Because the recursion is treated as an atomic group, there are now no backtracking points, and so the entire match fails. (Perl is able, at this point, to re- enter the recursion and try the second alternative.) However, if the pattern is written with the alternatives in the other order, things are different:
    ^((.)(?1)\2|.)$
This time, the recursing alternative is tried first, and continues to recurse until it runs out of characters, at which point the recursion fails. But this time we do have another alternative to try at the higher level. That is the big difference: in the previous case the remaining alternative is at a deeper recursion level, which PCRE cannot use.

To change the pattern so that matches all palindromic strings, not just those with an odd number of characters, it is tempting to change the pattern to this:
    ^((.)(?1)\2|.?)$
Again, this works in Perl, but not in PCRE, and for the same reason. When a deeper recursion has matched a single character, it cannot be entered again in order to match an empty string. The solution is to separate the two cases, and write out the odd and even cases as alternatives at the higher level:
    ^(?:((.)(?1)\2|)|((.)(?3)\4|.))$
WARNING!!!

The palindrome-matching patterns above work only if the subject string does not start with a palindrome that is shorter than the entire string. For example, although "abcba" is correctly matched, if the subject is "ababa", PCRE finds the palindrome "aba" at the start, then fails at top level because the end of the string does not follow. Once again, it cannot jump back into the recursion to try other alternatives, so the entire match fails.

其他引用

regular-expressions.info/Atomic grouping

(?>…)在某些方面是原子分组语法

环视 (?=…) , (?!…) , (?<=…) , (?<!…) , 都是原子的

Possessive量词(例如 a*+ )也是原子的

PCRE 递归子模式和子程序调用也是原子的

仔细看看图案

原子性论点是正确的，但它如何解释为什么模式表现得像观察到的那样可能并不明显。让我们仔细看看，看看这一切是如何契合的:

我们将使用第一个模式:

^(([a-z])(?1)\2|[a-z]?)$

我将使用以下符号来表示递归:

1表示该角色在第一个备选

中被捕获到第 2 组

2表示该字符与第二个备选匹配

或者如果 2不高于字符，? 的零重复选项被行使

\表示该字符与第一个备用

中第 2 组的反向引用匹配

_表示递归分支的底部

即使有其他选择，也不会重新进入此分支!

现在让我们考虑 "aaa"作为输入:

      _
1 1 1 2 
a a a   # This is the first bottom of the recursion,
        # now we go back to the third 1 and try to match \.
        # This fails, so the third 1 becomes 2.
    _
1 1 2
a a a   # Now we go back to the second 1 and try to match \.
        # This fails, so the second 1 becomes 2.
  _
1 2
a a a   # The second level matched! now we go back to the first level...

_____
1 2 \
a a a   # Now the first 1 can match \, and entire pattern matches!!

现在考虑 "aaaaa" :

          _
1 1 1 1 1 2
a a a a a  # Fifth 1 can't match \, so it becomes 2. 
        _
1 1 1 1 2
a a a a a  # Fourth 1 can't match \, so it becomes 2.
    _____
1 1 1 2 /
a a a a a  # Here's a crucial point. The third 1 successfully matched.
           # Now we're back to the second 1 and try to match \, but this fails.
           # However, since PCRE recursion is atomic, the third 1 will NOT be
           # reentered to try 2. Instead, we try 2 on the second 1.
_____
1 2 \
a a a a a  # Anchors don't match, so the first 1 becomes 2, and then also the
           # anchors don't match, so the pattern fails to match.

请注意，一旦递归级别与第一个备选方案匹配，则将来不会尝试第二个备选方案(即使这样做可能导致可能匹配)，因为 PCRE 子模式递归是原子的。

现在考虑 "aa" :

    _
1 1 2 
a a
  _
1 2
a a  # The second level matched by taking the one repetition option on ?.
     # We now go back to the first level, and we can't match \.
     # Since PCRE recursion is atomic, we can't go back to the second level
     # to try the zero repetition option on ?.
_    
2
a a  # Anchors don't match, trying zero option on ? also doesn't help,
     # so the pattern fails to match!

请注意，一旦递归级别与 ? 的一次重复匹配在第二种选择中，将来不会尝试零重复选项(即使这样做可能导致可能匹配)，因为 PCRE 子模式递归是原子的。

现在让我们考虑 aaaaaaa

              _
1 1 1 1 1 1 1 2  
a a a a a a a 
            _
1 1 1 1 1 1 2  
a a a a a a a 
        _____
1 1 1 1 1 2 \  
a a a a a a a  # A crucial point: the fifth level matched and now the fourth
               # level can't match \, but it does NOT reenter the fifth level to
               # try 2. Instead, the fourth level tries 2.
    _____    
1 1 1 2 \  
a a a a a a a 
  _________    
1 1 1 2 \ \ 
a a a a a a a 
_____________    
1 1 1 2 \ \ \
a a a a a a a  # Entire pattern is a match!

请注意，即使 PCRE 子模式递归是原子的，它仍然可以成功匹配由重复 2n-1 次的字符组成的回文。

现在，为了好玩，让我们试试 "abcba" :

          _
1 1 1 1 1 2
a b c b a
        _
1 1 1 1 2
a b c b a

1 1 1 2 
a b c b a   # Third level attempts \, but c does not match a!
            # So we go back to third 1 and try 2.
  _____
1 1 2 \ 
a b c b a 
_________
1 1 2 \ \
a b c b a   # Entire pattern is a match!

也就是说，模式不仅仅匹配“仅当一个字符重复 2n-1 次时”。确实可以匹配 "abcba" ( as seen on ideone.com )。然而，它不能匹配 "ababa" ，也不能匹配 "aaaaa" (请参阅手册页上的警告!)，因为 PCRE 中的子模式递归是原子的。

您可以应用相同的跟踪过程来解释模式在任何输入上的行为。

关于regex - 为什么这个递归正则表达式只在字符重复 2^n - 1 次时才匹配？，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/3738631/

文章推荐： "concatenate on join"的 SQL 查询

文章推荐： spring - 使用 Spring Cloud Gateway 进行全局异常处理

文章推荐： azure - 如何获取Azure DevOps服务连接ID

文章推荐： sql - ms-access 将查询结果保存在一个字符串中

regex - std.regex.regex 的返回值？
我正在尝试编写一个函数，该函数接受输入字符串、正则表达式(由 std.regex.regex 从原始字符串生成)和错误消息字符串，并尝试匹配来自使用正则表达式输入字符串，如果没有匹配则显示错误消息。到
regex - `regex{n,}?` == `regex{n}` ?
-edit- 注意 ?末.{2,}? 我发现你可以写 .{2,}? 是不是和下面一模一样？ .{2} 最佳答案号{2,}表示两次或更多次同时 {2}意思是正好两次。量词默认是贪婪的，所以给定字符串
regex - 如何在 RegEx 模式的 RegEx 匹配中包含第一个字符？
我有以下文字: This is a test ::a. MODE 3 within 7 hours, ::b. MODE 4 within 13 hours, and ::c. MODE 5 with
regex - 使用 Regex.fromLiteral() 创建的 Regex 到底匹配什么？
我用 Regex.fromLiteral(".*") 创建了一个非常简单的匹配所有正则表达式. 根据documentation :“返回指定文字字符串的文字正则表达式。” 但是我真的不明白“对于指定的
regex - 有什么好的技巧可以帮助记住 RegEx？
就目前而言，这个问题不适合我们的问答形式。我们希望答案得到事实、引用或专业知识的支持，但这个问题可能会引起辩论、争论、投票或扩展讨论。如果您觉得这个问题可以改进并可能重新打开，visit the he
regex - 如何在该位置使用Nginx Regex
该Web项目将静态内容放入一些/content/img文件夹中。网址规则是:/img/{some md5} 但在文件夹中的位置:/content/img/{前两位数字}/ 例子 url:
regex - 捕获两个空格之间的数字 (RegEx)
我有以下数据: SOMEDATA .test 01/45/12 2.50 THIS IS DATA 我想从中提取数字 2.50。我已设法使用以下 RegEx 做到这一点: (?<=\d{2}\/\d{
regex - REGEX 表达式的简化
我需要证明或反驳下面的正则表达式 (RS + R )* R = R (SR + R)* // or, for programmers: /(RS|R)*R/ == /R(SR|R)*/ 我有一种强烈的
regex - 如何找到所有与货币相关的数字 REGEX？
对于具有自由文本的字符串: "The shares of the stock at the XKI Market fell by €89.99 today, which saw a drop of a
regex - 将变量添加到 RegEx
例如，我有 RegEx DSX-?2 的 var 我需要将此变量添加到 RegEx 并获取此 .match(/DSX-?2/gi) 最佳答案您可以创建一个 RegExp对象使用 new RegExp
regex - REGEX Sqlite不区分大小写不起作用
我无法区分大小写的搜索无法在SQLITE中用于REGEX。支持语法吗？ SELECT * FROM table WHERE name REGEXP 'smith[s]*\i' 我希望得到以下答案（假设
regex - RegEx-捕获组中OR值的顺序更改结果
Visual Studio / XPath / RegEx：给定表达式： (?(Car|Car Blue)) +(?.+) +---> +(?.+) 给定搜索字符串： Car Blue Flying
regex - 有什么作用？ : do in regex
我有一个看起来像这样的正则表达式 /^(?:\w+\s)*(\w+)$*/ 什么是?: ? 最佳答案它表示子模式是非捕获子模式。这意味着在 (?:\w+\s) 中匹配的任何内容，即使它被 () 括起
regex - RegEx 引擎的工作原理
就目前而言，这个问题不适合我们的问答形式。我们希望答案得到事实、引用或专业知识的支持，但这个问题可能会引起辩论、争论、投票或扩展讨论。如果您觉得这个问题可以改进并可能重新打开，visit the he
regex - 使用正则表达式对值进行分组 regex
我在 Excel 工作表(也以 csv 格式)中获得了姓名列表，并根据姓名来源进行了分组。这就是我创建的组的样子。现在我想添加一个新列，名称后面包含组名称。这就是我想要获得的。我如何得到这个？
regex - 将所有特殊字符和单词分成字符串列表中的项目-Regex
我试图将一个字符串拆分为一个字符串列表，单词是分开的，但是周围的字符，例如.. "?()“”!"也分开。要分隔的字符串是"testing “testing” “one two three” (hi
regex - 为连续的完整日志文件中的每个部分匹配 REGEX
我有一个来自视频转换文件的完整日志，它看起来像这样: -------------------------------------------------------------------------
regex - REGEX 在多个定界符内只匹配一次
在定界符为“-”的模式 X-Y-Z 中，我想检查 Y 是否具有大小 8 而没有重复。 Y 可以是像 Y = (A-B-C) 这样的子集，但如果没有，则 Y 的值为 1 1 - num-12345678
java - 为什么Java有replaceAll(regex, replacement)而不是replaceAll(regex, regex)？
Java确实有这个功能，谢谢你的回答，对我来说失去对API的关注太可惜了... 例如: String strOriginal = "A:B&C@D"; 我认为java中应该有一个非常好的方法来改变它，
regex - 带增量的小时的正则表达式 (RegEx)
我只需要接受符合这些规则的输入... 0.25-24 0.25 的增量(.00、.25、.50、.75) 第一个数字不是必须的。希望尾随零是可选的。一些有效条目的示例: 0.25 .50 .5 1

行者123

个人简介

我是一名优秀的程序员,十分优秀！

作者热门文章

滴滴打车优惠券免费领取

全站热门文章

首页

博学

6Ren·AI

商城

regex - 为什么这个递归正则表达式只在字符重复 2^n - 1 次时才匹配？

WARNING!!!