gpt4 book ai didi

西里尔字母的正则表达式问题

转载 作者:行者123 更新时间:2023-12-04 11:57:25 26 4
gpt4 key购买 nike

我过去在正则表达式和西里尔字母方面遇到过问题,所以我想知道我是否做错了什么?

这里有两个可重现的例子:

示例 1 - 前瞻和后视断言的问题:

latin <- "city New York, Manhattan\n1st Avenue"
cyrilic <- "град Ню Йорк, Манхатън\n1во Авеню"

stringr::str_extract(latin, pattern = "(?<=city New York, )[\\w\\s]+(?=\n)")
#returns: Manhattan

stringr::str_extract(cyrilic, pattern = "(?<=град Ню Йорк, )[\\w\\s]+(?=\n)")
stringr::str_extract(cyrilic, pattern = "(?<=град Ню Йорк, ).+(?=\n)")
#both return: NA

示例 2 - grep 的 ignore.case = TRUE 问题:

randomWord <- "Човек"

grep(pattern = "човек", x = randomWord, ignore.case = T)
#returns: integer(0)

关于如何编写正则表达式以使其在西里尔字母中工作的任何想法?

我的默认文本编码是 UTF-8,这是我的 session 信息:

> sessionInfo()
R version 3.3.3 (2017-03-06)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows >= 8 x64 (build 9200)

locale:
[1] LC_COLLATE=Bulgarian_Bulgaria.1251 LC_CTYPE=Bulgarian_Bulgaria.1251
[3] LC_MONETARY=Bulgarian_Bulgaria.1251 LC_NUMERIC=C
[5] LC_TIME=Bulgarian_Bulgaria.1251

最佳答案

我不确定为什么 str_extract 在这种情况下返回 NA,因为看起来正则表达式是有效的。

但是 str_locatestr_detect 似乎按预期工作:

stringr::str_detect(cyrilic, "(?<=град Ню Йорк, )[\\w\\s]+(?=\n)")
#returns TRUE
stringr::str_locate(cyrilic, "(?<=град Ню Йорк, )[\\w\\s]+(?=\n)")
#returns the start and end positions for Манхатън

针对您的问题的解决方法是将 substr()str_locate 结合使用:

substr(cyrilic, 
stringr::str_locate(cyrilic, "(?<=град Ню Йорк, )[\\w\\s]+(?=\n)")[1],
stringr::str_locate(cyrilic, "(?<=град Ню Йорк, )[\\w\\s]+(?=\n)")[2]
)
#returns 'Манхатън'

关于西里尔字母的正则表达式问题,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/44287211/

26 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com