gpt4 book ai didi

java - 用于匹配和限制字符类的正则表达式

转载 作者:搜寻专家 更新时间:2023-10-31 19:38:41 25 4
gpt4 key购买 nike

我不确定使用 Regex 是否可行,但我希望能够根据不同的字符限制允许的下划线数量。这是为了将疯狂的通配符查询限制在用 Java 编写的搜索引擎中。

起始字符将是字母数字。但是如果下划线比前面的字符多,我基本上想要一个匹配项。所以

BA_ 没问题,但 BA___ 会匹配正则表达式并会被踢出查询解析器。

这可能使用正则表达式吗?

最佳答案

是的,你可以做到。这种模式只有在下划线少于字母的情况下才会成功(你可以用你想要的字符来调整它):

^(?:[A-Z](?=[A-Z]*(\\1?+_)))*+[A-Z]+\\1?$

(正如 Pshemo 所注意到的,如果您使用 matches() 方法,则不需要 anchor ,我写它们是为了说明这个模式必须以任何方式限制的事实。随着例如环视。)

否定版本:

^(?:[A-Z](?=[A-Z]*(\\1?+_)))*\\1?_*$

这个想法是重复一个包含对自身的反向引用+下划线的捕获组。在每次重复时,捕获组都在增长。 ^(?:[A-Z](?=[A-Z]*+(\\1?+_)))*+ 将匹配所有具有相应下划线的字母。您只需要添加 [A-Z]+ 以确保有更多字母,并使用包含所有下划线的 \\1? 结束您的模式(我制作它是可选的,以防根本没有下划线)。

请注意,如果您将第一个模式中的 [A-Z]+ 替换为 [A-Z]{n},您可以准确设置字母和字母之间的字符数差异下划线。


为了给出一个更好的主意,我将尝试逐步描述它如何与字符串 ABC-- 一起工作(因为不可能将下划线设为粗体,所以我使用连字符代替):

 In the non-capturing group, the first letter is found ABC--        ^(?:[A-Z](?=[A-Z]*(\1?+-)))*+[A-Z]+\1?$ let's enter the lookahead (keep in mind that all in the lookahead is only a check and not a part of the match result.)ABC--        ^(?:[A-Z](?=[A-Z]*(\1?+-)))*+[A-Z]+\1?$ABC--        ^(?:[A-Z](?=[A-Z]*(\1?+-)))*+[A-Z]+\1?$
 the first capturing group is encounter for the first time and its content is not defined. This is the reason why an optional quantifier is used, to avoid to make the lookahead fail. Consequence: \1?+ doesn't match something new.ABC--        ^(?:[A-Z](?=[A-Z]*(\1?+-)))*+[A-Z]+\1?$
 the first hyphen is matched. Once the capture group closed, the first capture    group is now defined and contains one hyphen. ABC--        ^(?:[A-Z](?=[A-Z]*(\1?+-)))*+[A-Z]+\1?$
 The lookahead succeeds, let's repeat the non-capturing group.ABC--        ^(?:[A-Z](?=[A-Z]*(\1?+-)))*+[A-Z]+\1?$
 The second letter is foundABC--        ^(?:[A-Z](?=[A-Z]*(\1?+-)))*+[A-Z]+\1?$ We enter the lookaheadABC--        ^(?:[A-Z](?=[A-Z]*(\1?+-)))*+[A-Z]+\1?$
 but now, things are different. The capture group was defined before and contains an hyphen, this is why \1?+ will match the first hyphen.ABC--        ^(?:[A-Z](?=[A-Z]*(\1?+-)))*+[A-Z]+\1?$ the literal hyphen matches the second hyphen in the string. And now the capture group 1 contains the two hypens. The lookahead succeeds.ABC--        ^(?:[A-Z](?=[A-Z]*(\1?+-)))*+[A-Z]+\1?$
 We repeat one more time the non capturing group.ABC--        ^(?:[A-Z](?=[A-Z]*(\1?+-)))*+[A-Z]+\1?$ In the lookahead. There is no more letters, it's not a problem, since the * quantifier is used.ABC--        ^(?:[A-Z](?=[A-Z]*(\1?+-)))*+[A-Z]+\1?$ \\1?+ matches now two hyphens.ABC--        ^(?:[A-Z](?=[A-Z]*(\1?+-)))*+[A-Z]+\1?$
 but there is no more hyphen in the string for the literal hypen and the regex engine can not use the bactracking since \1?+ has a possessive quantifier. The lookahead fails. Thus the third repetition of the non-capturing group too!ABC--        ^(?:[A-Z](?=[A-Z]*(\1?+-)))*+[A-Z]+\1?$ ensure that there is at least one more letter.ABC--        ^(?:[A-Z](?=[A-Z]*(\1?+-)))*+[A-Z]+\1?$
 We match the end of the string with the backreference to capture group 1 that contains the two hyphens. Note that the fact that this backreference is optional allows the string to not have hyphens at all. ABC--        ^(?:[A-Z](?=[A-Z]*(\1?+-)))*+[A-Z]+\1?$ This is the end of the string. The pattern succeeds.ABC--        ^(?:[A-Z](?=[A-Z]*(\1?+-)))*+[A-Z]+\1?$


Note: The use of the possessive quantifier for the non-capturing group is needed to avoid false results. (Where you can observe a strange behavior, that can be useful.)

Example:ABC--- and the pattern: ^(?:[A-Z](?=[A-Z]*(\1?+-)))*[A-Z]+\1?$ (without the possessive quantifier)

 The non-capturing group is repeated three times and `ABC` are matched:ABC---     ^(?:[A-Z](?=[A-Z]*(\1?+-)))*[A-Z]+\1?$ Note that at this step the first capturing group contains --- But after the non capturing group, there is no more letter to match for [A-Z]+ and the regex engine must backtrack.ABC---     ^(?:[A-Z](?=[A-Z]*(\1?+-)))*[A-Z]+\1?$

Question: How many hyphens are in the capture group now?
Answer:   Always three!

If the repeated non-capturing group gives a letter back, the capture group contains always three hyphens (as the last time the capture group has been read by the regex engine).This is counter-intuitive, but logical.

 Then the letter C is found:ABC---     ^(?:[A-Z](?=[A-Z]*(\1?+-)))*[A-Z]+\1?$ And the three hyphensABC---     ^(?:[A-Z](?=[A-Z]*(\1?+-)))*[A-Z]+\1?$ The pattern succeedsABC---     ^(?:[A-Z](?=[A-Z]*(\1?+-)))*[A-Z]+\1?$

Robby Pond asked me in comments how to find strings that have more underscores than letters (all that is not an underscore). The best way is obviously to count the numbers of underscores and to compare with the string length. But about a full regex solution, it is not possible to build a pattern for that with Java since the pattern needs to use the recursion feature. For example you can do it with PHP:

$pattern = <<<'EOD'
~
(?(DEFINE)
(?<neutral> (?: _ \g<neutral>?+ [A-Z] | [A-Z] \g<neutral>?+ _ )+ )
)

\A (?: \g<neutral> | _ )+ \z
~x
EOD;

var_dump(preg_match($pattern, '____ABC_DEF___'));

关于java - 用于匹配和限制字符类的正则表达式,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/23790887/

25 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com