gpt4 book ai didi

regex - R/正则表达式与 stringi/ICU : why is a '+' considered a non-[:punct:] character?

转载 作者:行者123 更新时间:2023-12-04 19:46:51 25 4
gpt4 key购买 nike

我正在尝试从字符串向量中删除非字母字符。我以为 [:punct:] 分组会覆盖它,但它似乎忽略了 +。这是否属于另一组字符?

library(stringi)
string1 <- c(
"this is a test"
,"this, is also a test"
,"this is the final. test"
,"this is the final + test!"
)

string1 <- stri_replace_all_regex(string1, '[:punct:]', ' ')
string1 <- stri_replace_all_regex(string1, '\\+', ' ')

最佳答案

POSIX 字符类需要包装在字符类中,正确的形式是 [[:punct:]] .不要将 POSIX 术语“字符类”与通常称为正则表达式字符类的东西混淆。

这个 ASCII 范围内的 POSIX 命名类匹配所有非控件非字母数字非空格 字符。

ascii <- rawToChar(as.raw(0:127), multiple=T)
paste(ascii[grepl('[[:punct:]]', ascii)], collapse="")
# [1] "!\"#$%&'()*+,-./:;<=>?@[\\]^_`{|}~"

虽然如果一个 locale 有效,它可以改变的行为[[:punct:]] ...

R 文档 ?regex声明如下:某些命名的字符类是预定义的。它们的解释取决于语言环境(参见 locales );解释是 POSIX 语言环境的解释。

开放组 LC_TYPE definition for punct说:

Define characters to be classified as punctuation characters.

In the POSIX locale, neither the <space> nor any characters in classes alpha, digit, or cntrl shall be included.

In a locale definition file, no character specified for the keywords upper, lower, alpha, digit, cntrl, xdigit, or as the <space> shall be specified.


然而,stringi 包似乎依赖于ICU语言环境是 ICU 中的一个基本概念。

使用 stringi 包,我推荐使用 Unicode Properties \p{P} and \p{S} .

  • \p{P}匹配任何类型的标点字符。也就是说,它缺少 POSIX 类 punct 包含的九个字符。这是因为 Unicode 将 POSIX 认为是标点符号的内容分为两类,标点符号符号。这是\p{S}的地方到位......

    stri_replace_all_regex(string1, '[\\p{P}\\p{S}]', ' ')
    # [1] "this is a test" "this is also a test"
    # [3] "this is the final test" "this is the final test "
  • 或退回到gsub来自 base R,它处理得很好。

    gsub('[[:punct:]]', ' ', string1)
    # [1] "this is a test" "this is also a test"
    # [3] "this is the final test" "this is the final test "

关于regex - R/正则表达式与 stringi/ICU : why is a '+' considered a non-[:punct:] character?,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/45070628/

25 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com