gpt4 book ai didi

c++ - 用于匹配 C++ 字符串常量的正则表达式

转载 作者:塔克拉玛干 更新时间:2023-11-02 23:26:38 24 4
gpt4 key购买 nike

我目前正在开发一个 C++ 预处理器,我需要匹配超过 0 个字母的字符串常量,如 "hey I'm a string .
我目前正在这里使用这个 \"([^\\\"]+|\\.)+\"但它在我的一个测试用例中失败了。

测试用例:

std::cout << "hello" << " world";
std::cout << "He said: \"bananas\"" << "...";
std::cout << "";
std::cout << "\x12\23\x34";

预期输出:
std::cout << String("hello") << String(" world");
std::cout << String("He said: \"bananas\"") << String("...");
std::cout << "";
std::cout << String("\x12\23\x34");

在第二个我反而得到
std::cout << String("He said: \")bananas\"String(" << ")...";

简短的重现代码(使用 AR.3 的正则表达式):
std::string in_line = "std::cout << \"He said: \\\"bananas\\\"\" << \"...\";";
std::regex r("\"([^\"]+|\\.|(?<=\\\\)\")+\"");
in_line = std::regex_replace(in_line, r, "String($&)");

最佳答案

对源文件进行词法分析对于正则表达式来说是一项很好的工作。但是对于这样的任务,让我们使用比 std::regex 更好的正则表达式引擎.让我们首先使用 PCRE(或 boost::regex)。在这篇文章的最后,我将展示您可以使用功能较少的引擎做什么。

我们只需要进行部分词法分析,忽略所有不会影响字符串文字的无法识别的标记。我们需要处理的是:

  • 单行评论
  • 多行注释
  • 字符字面量
  • 字符串文字


  • 我们将使用扩展( x )选项,它忽略模式中的空格。

    注释

    这是什么 [lex.comment]说:

    The characters /* start a comment, which terminates with the characters */. These comments do not nest. The characters // start a comment, which terminates immediately before the next new-line character. If there is a form-feed or a vertical-tab character in such a comment, only white-space characters shall appear between it and the new-line that terminates the comment; no diagnostic is required. [ Note: The comment characters //, /*, and */ have no special meaning within a // comment and are treated just like other characters. Similarly, the comment characters // and /* have no special meaning within a /* comment. — end note ]



    # singleline comment
    // .* (*SKIP)(*FAIL)

    # multiline comment
    | /\* (?s: .*? ) \*/ (*SKIP)(*FAIL)

    十分简单。如果您在那里匹配任何内容,只需 (*SKIP)(*FAIL) - 意思是你扔掉火柴。 (?s: .*? )适用于 s (单行)修饰符 .元字符,这意味着它可以匹配换行符。

    字 rune 字

    这是来自 [lex.ccon] 的语法:

     character-literal:  
    encoding-prefix(opt) ’ c-char-sequence ’
    encoding-prefix:
    one of u8 u U L
    c-char-sequence:
    c-char
    c-char-sequence c-char
    c-char:
    any member of the source character set except the single-quote ’, backslash \, or new-line character
    escape-sequence
    universal-character-name
    escape-sequence:
    simple-escape-sequence
    octal-escape-sequence
    hexadecimal-escape-sequence
    simple-escape-sequence: one of \’ \" \? \\ \a \b \f \n \r \t \v
    octal-escape-sequence:
    \ octal-digit
    \ octal-digit octal-digit
    \ octal-digit octal-digit octal-digit
    hexadecimal-escape-sequence:
    \x hexadecimal-digit
    hexadecimal-escape-sequence hexadecimal-digit


    让我们先定义一些东西,稍后我们将需要它们:

    (?(DEFINE)
    (?<prefix> (?:u8?|U|L)? )
    (?<escape> \\ (?:
    ['"?\\abfnrtv] # simple escape
    | [0-7]{1,3} # octal escape
    | x [0-9a-fA-F]{1,2} # hex escape
    | u [0-9a-fA-F]{4} # universal character name
    | U [0-9a-fA-F]{8} # universal character name
    ))
    )
  • prefix被定义为可选 u8 , u , UL
  • escape是按照标准定义的,除了我已经合并了 universal-character-name为简单起见

  • 一旦我们有了这些,字 rune 字就非常简单了:

    (?&prefix) ' (?> (?&escape) | [^'\\\r\n]+ )+ ' (*SKIP)(*FAIL)

    我们用 (*SKIP)(*FAIL) 扔掉它

    简单的字符串

    它们的定义方式几乎与字 rune 字相同。这是 [lex.string]的一部分:

      string-literal:
    encoding-prefix(opt) " s-char-sequence(opt) "
    encoding-prefix(opt) R raw-string
    s-char-sequence:
    s-char
    s-char-sequence s-char
    s-char:
    any member of the source character set except the double-quote ", backslash \, or new-line character
    escape-sequence
    universal-character-name


    这将反射(reflect)字 rune 字:

    (?&prefix) " (?> (?&escape) | [^"\\\r\n]+ )* "

    区别在于:
  • 这次的字符序列是可选的( * 而不是 + )
  • 未转义时不允许使用双引号而不是单引号
  • 我们实际上不会扔掉它:)

  • 原始字符串

    这是原始字符串部分:

      raw-string:
    " d-char-sequence(opt) ( r-char-sequence(opt) ) d-char-sequence(opt) "
    r-char-sequence:
    r-char
    r-char-sequence r-char
    r-char:
    any member of the source character set, except a right parenthesis )
    followed by the initial d-char-sequence (which may be empty) followed by a double quote ".
    d-char-sequence:
    d-char
    d-char-sequence d-char
    d-char:
    any member of the basic source character set except:
    space, the left parenthesis (, the right parenthesis ), the backslash \,
    and the control characters representing horizontal tab,
    vertical tab, form feed, and newline.


    这个的正则表达式是:

    (?&prefix) R " (?<delimiter>[^ ()\\\t\x0B\r\n]*) \( (?s:.*?) \) \k<delimiter> "
  • [^ ()\\\t\x0B\r\n]*是分隔符中允许的字符集 ( d-char )
  • \k<delimiter>指之前匹配的分隔符

  • 完整的图案

    完整的模式是:

    (?(DEFINE)
    (?<prefix> (?:u8?|U|L)? )
    (?<escape> \\ (?:
    ['"?\\abfnrtv] # simple escape
    | [0-7]{1,3} # octal escape
    | x [0-9a-fA-F]{1,2} # hex escape
    | u [0-9a-fA-F]{4} # universal character name
    | U [0-9a-fA-F]{8} # universal character name
    ))
    )

    # singleline comment
    // .* (*SKIP)(*FAIL)

    # multiline comment
    | /\* (?s: .*? ) \*/ (*SKIP)(*FAIL)

    # character literal
    | (?&prefix) ' (?> (?&escape) | [^'\\\r\n]+ )+ ' (*SKIP)(*FAIL)

    # standard string
    | (?&prefix) " (?> (?&escape) | [^"\\\r\n]+ )* "

    # raw string
    | (?&prefix) R " (?<delimiter>[^ ()\\\t\x0B\r\n]*) \( (?s:.*?) \) \k<delimiter> "

    the demo here .
    boost::regex
    这是一个使用 boost::regex 的简单演示程序:

    #include <string>
    #include <iostream>
    #include <boost/regex.hpp>

    static void test()
    {
    boost::regex re(R"regex(
    (?(DEFINE)
    (?<prefix> (?:u8?|U|L) )
    (?<escape> \\ (?:
    ['"?\\abfnrtv] # simple escape
    | [0-7]{1,3} # octal escape
    | x [0-9a-fA-F]{1,2} # hex escape
    | u [0-9a-fA-F]{4} # universal character name
    | U [0-9a-fA-F]{8} # universal character name
    ))
    )

    # singleline comment
    // .* (*SKIP)(*FAIL)

    # multiline comment
    | /\* (?s: .*? ) \*/ (*SKIP)(*FAIL)

    # character literal
    | (?&prefix)? ' (?> (?&escape) | [^'\\\r\n]+ )+ ' (*SKIP)(*FAIL)

    # standard string
    | (?&prefix)? " (?> (?&escape) | [^"\\\r\n]+ )* "

    # raw string
    | (?&prefix)? R " (?<delimiter>[^ ()\\\t\x0B\r\n]*) \( (?s:.*?) \) \k<delimiter> "
    )regex", boost::regex::perl | boost::regex::no_mod_s | boost::regex::mod_x | boost::regex::optimize);

    std::string subject(R"subject(
    std::cout << L"hello" << " world";
    std::cout << "He said: \"bananas\"" << "...";
    std::cout << "";
    std::cout << "\x12\23\x34";
    std::cout << u8R"hello(this"is\a\""""single\\(valid)"
    raw string literal)hello";

    "" // empty string
    '"' // character literal

    // this is "a string literal" in a comment
    /* this is
    "also inside"
    //a comment */

    // and this /*
    "is not in a comment"
    // */

    "this is a /* string */ with nested // comments"
    )subject");

    std::cout << boost::regex_replace(subject, re, "String\\($&\\)", boost::format_all) << std::endl;
    }

    int main(int argc, char **argv)
    {
    try
    {
    test();
    }
    catch(std::exception ex)
    {
    std::cerr << ex.what() << std::endl;
    }

    return 0;
    }

    (我禁用了语法高亮,因为它在这段代码上很疯狂)

    出于某种原因,我不得不接受 ?量词出 prefix (将 (?<prefix> (?:u8?|U|L)? ) 更改为 (?<prefix> (?:u8?|U|L) ) 并将 (?&prefix) 更改为 (?&prefix)? )以使模式起作用。我相信这是 boost::regex 中的一个错误,因为 PCRE 和 Perl 在原始模式下都可以正常工作。

    如果我们手头没有漂亮的正则表达式引擎怎么办?

    请注意,虽然此模式在技术上使用递归,但它从不嵌套递归调用。通过将相关的可重用部分内联到主模式中,可以避免递归。

    可以以降低性能为代价避免使用其他一些结构。我们可以安全地替换原子组 (?> ... )与正常组 (?: ... )如果我们不嵌套量词以避免 catastrophic backtracking .

    我们也可以避免 (*SKIP)(*FAIL)如果我们在替换函数中添加一行逻辑:所有要跳过的替代项都分组在一个捕获组中。如果捕获组匹配,则忽略匹配。如果不是,那么它是一个字符串文字。

    所有这一切都意味着我们可以在 JavaScript 中实现它,它拥有你能找到的最简单的正则表达式引擎之一,代价是打破 DRY 规则并使模式难以辨认。一旦转换,正则表达式就变成了这个怪物:

    (\/\/.*|\/\*[\s\S]*?\*\/|(?:u8?|U|L)?'(?:\\(?:['"?\\abfnrtv]|[0-7]{1,3}|x[0-9a-fA-F]{1,2}|u[0-9a-fA-F]{4}|U[0-9a-fA-F]{8})|[^'\\\r\n])+')|(?:u8?|U|L)?"(?:\\(?:['"?\\abfnrtv]|[0-7]{1,3}|x[0-9a-fA-F]{1,2}|u[0-9a-fA-F]{4}|U[0-9a-fA-F]{8})|[^"\\\r\n])*"|(?:u8?|U|L)?R"([^ ()\\\t\x0B\r\n]*)\([\s\S]*?\)\2"

    这是您可以玩的交互式演示:

    function run() {
    var re = /(\/\/.*|\/\*[\s\S]*?\*\/|(?:u8?|U|L)?'(?:\\(?:['"?\\abfnrtv]|[0-7]{1,3}|x[0-9a-fA-F]{1,2}|u[0-9a-fA-F]{4}|U[0-9a-fA-F]{8})|[^'\\\r\n])+')|(?:u8?|U|L)?"(?:\\(?:['"?\\abfnrtv]|[0-7]{1,3}|x[0-9a-fA-F]{1,2}|u[0-9a-fA-F]{4}|U[0-9a-fA-F]{8})|[^"\\\r\n])*"|(?:u8?|U|L)?R"([^ ()\\\t\x0B\r\n]*)\([\s\S]*?\)\2"/g;

    var input = document.getElementById("input").value;
    var output = input.replace(re, function(m, ignore) {
    return ignore ? m : "String(" + m + ")";
    });
    document.getElementById("output").innerText = output;
    }

    document.getElementById("input").addEventListener("input", run);
    run();
    <h2>Input:</h2>
    <textarea id="input" style="width: 100%; height: 50px;">
    std::cout << L"hello" << " world";
    std::cout << "He said: \"bananas\"" << "...";
    std::cout << "";
    std::cout << "\x12\23\x34";
    std::cout << u8R"hello(this"is\a\""""single\\(valid)"
    raw string literal)hello";

    "" // empty string
    '"' // character literal

    // this is "a string literal" in a comment
    /* this is
    "also inside"
    //a comment */

    // and this /*
    "is not in a comment"
    // */

    "this is a /* string */ with nested // comments"
    </textarea>
    <h2>Output:</h2>
    <pre id="output"></pre>

    关于c++ - 用于匹配 C++ 字符串常量的正则表达式,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/41909225/

    24 4 0
    Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
    广告合作:1813099741@qq.com 6ren.com