regex - 为什么 strsplit 使用积极的前瞻和后视断言匹配不同？-6ren

regex - 为什么 strsplit 使用积极的前瞻和后视断言匹配不同？

转载作者：行者123 更新时间：2023-12-03 10:22:57

使用 gregexpr() 进行常识和健全性检查指示下面的后视和前视断言应该每个都在 testString 中的一个位置匹配。 :

testString <- "text XX text"
BB  <- "(?<= XX )"
FF  <- "(?= XX )"

as.vector(gregexpr(BB, testString, perl=TRUE)[[1]])
# [1] 9
as.vector(gregexpr(FF, testString, perl=TRUE)[[1]][1])
# [1] 5

strsplit() ，但是，以不同的方式使用这些匹配位置，拆分 testString在一使用后视断言时的位置，但在两个位置——第二个似乎不正确——在使用前瞻断言时。

strsplit(testString, BB, perl=TRUE)
# [[1]]
# [1] "text XX " "text"    

strsplit(testString, FF, perl=TRUE)
# [[1]]
# [1] "text"    " "       "XX text"

我有两个问题: (Q1) 这里发生了什么？和 (Q2) 怎样才能得到 strsplit()表现得更好？

更新: Theodore Lytras 的出色回答解释了正在发生的事情，因此地址 (Q1) .我的回答建立在他确定补救措施的基础上，地址为 (Q2) .

最佳答案

我不确定这是否属于错误，因为我相信这是基于 R 文档的预期行为。来自 ?strsplit :

The algorithm applied to each input string is
repeat {
    if the string is empty
        break.
    if there is a match
        add the string to the left of the match to the output.
        remove the match and all to the left of it.
    else
        add the string to the output.
        break.
}
Note that this means that if there is a match at the beginning of a (non-empty) string, the first element of the output is ‘""’, but if there is a match at the end of the string, the output is the same as with the match removed.

问题是前瞻(和后视)断言是零长度的。因此，例如在这种情况下:

FF <- "(?=funky)"
testString <- "take me to funky town"

gregexpr(FF,testString,perl=TRUE)
# [[1]]
# [1] 12
# attr(,"match.length")
# [1] 0
# attr(,"useBytes")
# [1] TRUE

strsplit(testString,FF,perl=TRUE)
# [[1]]
# [1] "take me to " "f"           "unky town"

发生的事情是孤独的前瞻 (?=funky)在位置 12 处匹配。因此，第一个拆分包括位置 11(匹配项左侧)之前的字符串，并将其与匹配项一起从字符串中删除，但匹配项的长度为零。

现在剩下的字符串是 funky town , 并且前瞻在位置 1 处匹配。但是没有什么要删除的，因为匹配的左边没有任何东西，而且匹配本身的长度为零。所以算法陷入了无限循环。显然，R 通过拆分单个字符来解决这个问题，顺便提一下，这是 strsplit 时记录的行为。使用空的正则表达式(当参数为 split="" 时)。在此之后剩余的字符串是 unky town ，因为没有匹配，它作为最后一个分割返回。

Lookbehinds 没有问题，因为每个匹配项都被拆分并从剩余的字符串中删除，因此算法永远不会卡住。

诚然，这种行为乍一看很奇怪。然而，否则行为将违反前瞻为零长度的假设。鉴于 strsplit算法被记录在案，我相信这不符合错误的定义。

关于regex - 为什么 strsplit 使用积极的前瞻和后视断言匹配不同？，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/15575221/

文章推荐： wpf - 禁用 WPF 应用程序的 DPI 感知

文章推荐： c# - StackPanel 的 MouseWheel EventToCommand

文章推荐： wpf - View 和 View 模型中的WPF MVVM按钮调用方法

文章推荐： asp.net-mvc - 为什么 DisplayFormat DataFormatString 不起作用？

linux - 通过 802.11n 的 UDP 单播 - L2 积极 ACK 对 Linux 套接字发送缓冲区的影响
谁能解释当应用程序通过 802.11 WiFi 网络发送 UDP 单播数据报时它是如何工作的？假设非阻塞 UDP 套接字。具体而言，假设 802.11n 或 802.11ac 以及相当新的 Linux

行者123

个人简介

我是一名优秀的程序员,十分优秀！

作者热门文章

滴滴打车优惠券免费领取

全站热门文章

首页

博学

6Ren·AI

商城

regex - 为什么 strsplit 使用积极的前瞻和后视断言匹配不同？