R 正则表达式 : using\\b with 'Å' vs. 'A' 个字符-6ren

R 正则表达式 : using\\b with 'Å' vs. 'A' 个字符

转载作者：行者123 更新时间：2023-12-04 01:31:53

\\b 表示一个单词边界。我不明白为什么这个运算符会根据后面的字符产生不同的效果。示例:

test1 <- 'aland islands'
test2 <- 'åland islands'

regex1 <- "[å|a]land islands"
regex2 <- "\\b[å|a]land islands"

grepl(regex1, test1, perl = TRUE)
[1] TRUE
grepl(regex2, test1, perl = TRUE)
[1] TRUE

grepl(regex1, test2, perl = TRUE)
[1] TRUE
grepl(regex2, test2, perl = TRUE)
[1] FALSE

这似乎只有在 perl = TRUE 时才是问题:

grepl(regex1, test2, perl = FALSE)
[1] TRUE
grepl(regex2, test2, perl = FALSE)
[1] TRUE

不幸的是，在我的应用程序中，我绝对需要保持 perl=TRUE。

最佳答案

这是 R 的正则表达式子系统中的一个(已知)故障，与输入的字符编码和系统区域设置/内置属性有关。

grep 上的 R 文档状态(添加了突出显示):

The POSIX 1003.2 mode of gsub and gregexpr does not work correctly with repeated word-boundaries (e.g., pattern = "\b"). Use perl = TRUE for such matches (but that may not work as expected with non-ASCII inputs, as the meaning of ‘word’ is system-dependent).

这里只提到了gsub 和grepexpr grepl 似乎也受到了影响。

可能的解决方案

使用 R 的默认 (TRE reference) 正则表达式引擎:PERL=FALSE 正如您已经发现的那样。
坚持使用 PCRE ( reference ) 正则表达式使用 *UCP 标志(Unicode 模式|Unicode 字符属性)，它改变了匹配行为，因此 Unicode 字母数字不被视为单词边界:
代码示例:
```
options(encoding = "UTF-8")

test1 <- 'aland islands'
test2 <- 'åland islands'
regex1 <- "[å|a]land islands"
regex2 <- "(*UCP)\\b[å|a]land islands"    
grepl(regex1, test2, perl = TRUE)
#[1] TRUE
grepl(regex2, test2, perl = TRUE)
#[1] TRUE
grepl(regex1, test2, perl = TRUE)
#[1] TRUE
grepl(regex2, test2, perl = TRUE)
#[1] TRUE
grepl(regex1, test2, perl = FALSE)
#[1] TRUE
grepl(regex2, test2, perl = FALSE)
#[1] FALSE
```
Online Demo
注意事项:
- 使用带有 (*UCP) 标志的 TRE 的第 6 个测试失败了 grepl(regex2, test2, perl = FALSE)
- 如果 R 未安装 PCRE 的 Unicode 支持，则 *UCP 标志不起作用(可能是 some environments 中的情况，例如一些最小的 Cloud/Docker 安装)。