c - 如何将 pcre2 修复为\w 将匹配标记？-6ren

c - 如何将 pcre2 修复为\w 将匹配标记？

转载作者：太空宇宙更新时间：2023-11-04 02:21:25

我使用可以找到的 Pcre2 库 here .

如你所见here Pcre2 \w 仅匹配 L 和 N 类别和下划线，不匹配 M - 标记(参见 here ).然而 .Net Regex 匹配标记(参见 here )。

我想更改 PCRE2 的源代码以使其表现得像 .Net Regex，只是我不确定我做的是否正确。

我想做的是在代码中找到引用PT_WORD 的地方，比如this :

case PT_WORD:
    if ((PRIV(ucp_gentype)[prop->chartype] == ucp_L ||
         PRIV(ucp_gentype)[prop->chartype] == ucp_N ||
         fc == CHAR_UNDERSCORE) == (Fop == OP_NOTPROP))

然后像这样添加另一行:

case PT_WORD:
    if ((PRIV(ucp_gentype)[prop->chartype] == ucp_L ||
         PRIV(ucp_gentype)[prop->chartype] == ucp_N ||
         PRIV(ucp_gentype)[prop->chartype] == ucp_M || // <-- new line
         fc == CHAR_UNDERSCORE) == (Fop == OP_NOTPROP))

这样做对吗？还有其他需要考虑的事情吗？我还需要在代码的其他地方更改什么？

最佳答案

A .NET \w construct匹配

Category    DescriptionLl          Letter, LowercaseLu          Letter, UppercaseLt          Letter, TitlecaseLo          Letter, OtherLm          Letter, ModifierMn          Mark, NonspacingNd          Number, Decimal DigitPc          Punctuation, Connector. This category includes ten characters, the most commonly used of which is the LOWLINE character (_), u+005F.

Note the differences: .NET \w does not match all numbers, only those from the Nd category, and as for the M category, it only matches Mn subset.

Make sure you match these Unicode categories within your code and \w will behave as in .NET regex.

Use

case PT_WORD:
    if ((PRIV(ucp_gentype)[prop->chartype] == ucp_Ll ||
         PRIV(ucp_gentype)[prop->chartype] == ucp_Lu ||
         PRIV(ucp_gentype)[prop->chartype] == ucp_Lt ||
         PRIV(ucp_gentype)[prop->chartype] == ucp_Lo ||
         PRIV(ucp_gentype)[prop->chartype] == ucp_Lm ||
         PRIV(ucp_gentype)[prop->chartype] == ucp_Mn ||
         PRIV(ucp_gentype)[prop->chartype] == ucp_Nd ||
         PRIV(ucp_gentype)[prop->chartype] == ucp_Lm ||
         PRIV(ucp_gentype)[prop->chartype] == ucp_Pc) == (Fop == OP_NOTPROP))
      RRETURN(MATCH_NOMATCH);
break;

请注意，您不需要关心 fc == CHAR_UNDERSCORE，因为它是 \p{Pc} 的一部分，您不能只使用 ucp_L 因为它还包括 \p{LC}。

关于c - 如何将 pcre2 修复为\w 将匹配标记？，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/57477893/