gpt4 book ai didi

regex - R:使用多个正则表达式模式和异常拆分文本

转载 作者:行者123 更新时间:2023-12-04 14:46:12 24 4
gpt4 key购买 nike

想要分割字符元素的向量 text在句子中。有不止一种拆分标准模式( "and/ERT""/$" )。模式中也有异常(exception)( :/$.and/ERT then./$. Smiley )。

尝试:匹配应该拆分的情况。在那个地方插入一个不寻常的图案 ( "^&*" )。 strsplit具体模式

问题:我不知道如何正确处理异常。有一些明确的情况,在运行 "^&*" 之前应该消除异常模式( strsplit )并恢复原始文本。 .

代码:

text <- c("This are faulty propositions one and/ERT two ,/$, which I want to split ./$. There are cases where I explicitly want and/ERT some where I don't want to split ./$. For example :/$. when there is an and/ERT then I don't want to split ./$. This is also one case where I dont't want to split ./$. Smiley !/$. Thank you ./$!",
"This are the same faulty propositions one and/ERT two ,/$, which I want to split ./$. There are cases where I explicitly want and/ERT some where I don't want to split ./$. For example :/$. when there is an and/ERT then I don't want to split ./$. This is also one case where I dont't want to split ./$. Smiley !/$. Thank you ./$!",
"Like above the same faulty propositions one and/ERT two ,/$, which I want to split ./$. There are cases where I explicitly want and/ERT some where I don't want to split ./$. For example :/$. when there is an and/ERT then I don't want to split ./$. This is also one case where I dont't want to split ./$. Smiley !/$. Thank you ./$!")

patternSplit <- c("and/ERT", "/\\$") # The class of split-cases is much larger then in this example. Therefore it is not possible to adress them explicitly.
patternSplit <- paste("(", paste(patternSplit, collapse = "|"), ")", sep = "")

exceptionsSplit <- c("\\:/\\$\\.", "and/ERT then", "\\./\\$\\. Smiley")
exceptionsSplit <- paste("(", paste(exceptionsSplit, collapse = "|"), ")", sep = "")

# If you don't have exceptions, it works here. Unfortunately it splits "*$/*" into "*" and "$/*". Would be convenient to avoid this. See example "ideal" split below.
textsplitted <- strsplit(gsub(patternSplit, "^&*\\1", text), "^&*", fixed = TRUE) #

# Ideal split:
textsplitted
> textsplitted
[[1]]
[1] "This are faulty propositions one and/ERT"
[2] "two ,/$,"
[3] "which I want to split ./$."
[4] "There are cases where I explicitly want and/ERT"
[5] "some where I don't want to split ./$."
[6] "For example :/$. when there is an and/ERT then I don't want to split ./$."
[7] "This is also one case where I dont't want to split ./$. Smiley !/$."
[8] "Thank you ./$!"

[[2]]
[1] "This are the same faulty propositions one and/ERT
[2] "two ,/$,"
#...

# This try doesen't work!
text <- gsub(patternSplit, "^&*\\1", text)
text <- gsub(exceptionsSplit, "[original text without "^&*"]", text)
textsplitted <- strsplit(text, "^&*", fixed = TRUE)

最佳答案

我认为您可以使用此表达式来实现您想要的拆分。如 strsplit用完它拆分的字符,您将不得不在要匹配/不匹配的事物后面的空格上拆分(这是您在 OP 中所需的输出中所拥有的):

strsplit( text[[1]] , "(?<=and/ERT)\\s(?!then)|(?<=/\\$[[:punct:]])(?<!:/\\$[[:punct:]])\\s(?!Smiley)"  , perl = TRUE )
#[[1]]
#[1] "This are faulty propositions one and/ERT"
#[2] "two ,/$,"
#[3] "which I want to split ./$."
#[4] "There are cases where I explicitly want and/ERT"
#[5] "some where I don't want to split ./$."
#[6] "For example :/$. when there is an and/ERT then I don't want to split ./$."
#[7] "This is also one case where I dont't want to split ./$. Smiley !/$."
#[8] "Thank you ./$!"

解释
  • (?<=and/ERT)\\s - 在空格上分割,\\s那个 IS 之前,(?<=...)来自 "and/ERT"
  • (?!then) - 但是 仅当该空间为 不是 关注,(?!...)来自 "then"
  • | - OR 运算符链接下一个表达式
  • (?<=/\\$[[:punct:]]) - "/$" 的正向后视断言后跟任何标点字母
  • (?<!:/\\$[[:punct:]])\\s(?!Smiley) - 匹配一个空格 不是 前面是 ":/$"[[:punct:]] (但根据前一点 IS 前面是 "/$[[:punct:]]" NOT 后面是 (?!...)"Smiley"7 |108

  • 关于regex - R:使用多个正则表达式模式和异常拆分文本,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/18697005/

    24 4 0
    Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
    广告合作:1813099741@qq.com 6ren.com