我使用可以找到的 Pcre2 库 here .
如你所见here Pcre2 \w
仅匹配 L
和 N
类别和下划线,不匹配 M
- 标记(参见 here ).然而 .Net Regex 匹配标记(参见 here )。
我想更改 PCRE2 的源代码以使其表现得像 .Net Regex,只是我不确定我做的是否正确。
我想做的是在代码中找到引用PT_WORD
的地方,比如this :
case PT_WORD:
if ((PRIV(ucp_gentype)[prop->chartype] == ucp_L ||
PRIV(ucp_gentype)[prop->chartype] == ucp_N ||
fc == CHAR_UNDERSCORE) == (Fop == OP_NOTPROP))
然后像这样添加另一行:
case PT_WORD:
if ((PRIV(ucp_gentype)[prop->chartype] == ucp_L ||
PRIV(ucp_gentype)[prop->chartype] == ucp_N ||
PRIV(ucp_gentype)[prop->chartype] == ucp_M || // <-- new line
fc == CHAR_UNDERSCORE) == (Fop == OP_NOTPROP))
这样做对吗?还有其他需要考虑的事情吗?我还需要在代码的其他地方更改什么?
A .NET \w
construct匹配
Category DescriptionLl Letter, LowercaseLu Letter, UppercaseLt Letter, TitlecaseLo Letter, OtherLm Letter, ModifierMn Mark, NonspacingNd Number, Decimal DigitPc Punctuation, Connector. This category includes ten characters, the most commonly used of which is the LOWLINE character (_), u+005F.
Note the differences: .NET \w
does not match all numbers, only those from the Nd
category, and as for the M
category, it only matches Mn
subset.
Make sure you match these Unicode categories within your code and \w
will behave as in .NET regex.
Use
case PT_WORD:
if ((PRIV(ucp_gentype)[prop->chartype] == ucp_Ll ||
PRIV(ucp_gentype)[prop->chartype] == ucp_Lu ||
PRIV(ucp_gentype)[prop->chartype] == ucp_Lt ||
PRIV(ucp_gentype)[prop->chartype] == ucp_Lo ||
PRIV(ucp_gentype)[prop->chartype] == ucp_Lm ||
PRIV(ucp_gentype)[prop->chartype] == ucp_Mn ||
PRIV(ucp_gentype)[prop->chartype] == ucp_Nd ||
PRIV(ucp_gentype)[prop->chartype] == ucp_Lm ||
PRIV(ucp_gentype)[prop->chartype] == ucp_Pc) == (Fop == OP_NOTPROP))
RRETURN(MATCH_NOMATCH);
break;
请注意,您不需要关心 fc == CHAR_UNDERSCORE
,因为它是 \p{Pc}
的一部分,您不能只使用 ucp_L
因为它还包括 \p{LC}
。
我是一名优秀的程序员,十分优秀!