gpt4 book ai didi

r - 将文本检索到数据帧的两列中的正则表达式模式匹配错误

转载 作者:行者123 更新时间:2023-12-04 10:48:34 26 4
gpt4 key购买 nike

考虑以下假设数据:

x <- "There is a horror movie running in the iNox theater. : If row names are supplied of length one and the data 
frame has a single row, the row.names is taken to specify the row names and not a column (by name or number).
If row names are supplied of length one and the data frame has a single row, the row.names is taken to specify
the row names and not a column (by name or number) Can we go : Please"


y <- "There is a horror movie running in the iNox theater. If row names are supplied of length one and the data
frame has a single row, the row.names is taken. To specify the row names and not a column. By name or number. :
If row names are supplied of length one and the data frame has a single row, the row.names is taken to specify
the row names and not a column (by name or number) Can we go : Please"

z <- "There is a horror movie running in the iNox theater. If row names are supplied of length one and the data frame has a single row, the row.names is taken to specify the row names and not a column (by name or number).
If row names are supplied of length one. : And the data frame has a single row, the row.names is taken to specify
the row names and not a column (by name or number) Can we go : Please"

df <- data.frame(Text = c(x, y, z), row.names = NULL, stringsAsFactors = F)

您是否注意到不同位置有一个“:”。例如:
  • 在 'x' 中,它 ( ":") 位于第一句话之后。
  • 在 'y' 中,它 ( ":") 位于第四句之后。
  • 在“z”中,它在第六句之后。
  • 此外,每篇课文的最后一句前还有一个“:”。

  • 我想要做的是,创建两列,以便:
  • 只考虑第一个“:”,而不是最后一个。
  • 如果前三个句子中有“:”,则将整个文本分成两列,否则,将所有文本保留在第二列中,将 'NA' 保留在第一列中。

  • 'x' 想要的输出:
     Col1                                                        Col2 
    There is a horror movie running in the iNox theater. If row names are supplied of length one and the data frame has a single row, the row.names is taken to specify the row names and not a column (by name or number). If row names are supplied of length one and the data frame has a single row, the row.names is taken to specify the row names and not a column (by name or number) Can we go : Please

    'y' 的通缉输出(因为在前三个句子中找不到“:”,因此):
     Col1     Col2 
    NA There is a horror movie running in the iNox theater. If row names are supplied of length one and the data frame has a single row, the row.names is taken. To specify the row names and not a column. By name or number. : If row names are supplied of length one and the data frame has a single row, the row.names is taken to specify the row names and not a column (by name or number) Can we go : Please

    就像上面 'y' 的结果一样,'z' 的想要的输出结果应该是:
      Col1    Col2
    NA all of the text from 'z'

    我想做的是:
    resX <- data.frame(Col1 = gsub("\\s\\:.*$","\\1", df$Text[[1]]), 
    Col2 = gsub("^[^:]+(?:).\\s","\\1", df$Text[[1]]))

    resY <- data.frame(Col1 = gsub("\\s\\:.*$","\\1", df$Text[[2]]),
    Col2 = gsub("^[^:]+(?:).\\s","\\1", df$Text[[2]]))

    resZ <- data.frame(Col1 = gsub("\\s\\:.*$","\\1", df$Text[[3]]),
    Col2 = gsub("^[^:]+(?:).\\s","\\1", df$Text[[3]]))

    然后使用 rbind 将上面的数据合并到结果数据帧“resDF”中。

    问题是:
  • 可以使用“for() 循环”或任何其他使代码更简单的方法来完成上述操作。
  • “y”和“z”文本的结果不是我想要的(如上所示)。
  • 最佳答案

    简短的

    我的灵感来自 Rizwan's answer使我的,所以你会看到他的回答完成了我的。我不喜欢的是它在非句子开始时中断(例如 row.names - 尽管 OP 提供的文本样本没有提供任何示例,其中 row.names 在前 2 个句子中出现了 3 次来展示这一点)。我还确保捕获组/列的编号与 OP 预期的完全相同,并且始终匹配。我的回答确实是对 Rizwan 的改进。

    注 1:我假设“句子”由句点/点定义,后跟至少一个水平空格 .
    注 2:这适用于 PCRE 正则表达式,未经其他正则表达式风格测试,可能需要适应其他正则表达式风格才能正常工作(即 if/else、垂直空白和水平空白标记)

    代码

    See this code in use here

    ^(?(?!(?:[^:\v]*?\.\h){3,})([^:\v]*?)\s*:\s*|)(.*)$

    结果

    输入
    There is a horror movie running in the iNox theater. : If row names are supplied of length one and the data frame has a single row, the row.names is taken to specify the row names and not a column (by name or number). If row names are supplied of length one and the data frame has a single row, the row.names is taken to specify the row names and not a column (by name or number) Can we go : Please

    There is a horror movie running in the iNox theater. If row names are supplied of length one and the data frame has a single row, the row.names is taken. To specify the row names and not a column. By name or number. : If row names are supplied of length one and the data frame has a single row, the row.names is taken to specify the row names and not a column (by name or number) Can we go : Please

    There is a horror movie running in the iNox theater. If row names are supplied of length one and the data frame has a single row, the row.names is taken to specify the row names and not a column (by name or number). If row names are supplied of length one. : And the data frame has a single row, the row.names is taken to specify the row names and not a column (by name or number) Can we go : Please

    输出

    第 1 场
  • 第 1 组:There is a horror movie running in the iNox theater.
  • 第 2 组:If row names are supplied of length one and the data frame has a single row, the row.names is taken to specify the row names and not a column (by name or number). If row names are supplied of length one and the data frame has a single row, the row.names is taken to specify the row names and not a column (by name or number) Can we go : Please

  • 第 2 场
  • 第 1 组:空 - 不匹配
  • 第 2 组:There is a horror movie running in the iNox theater. If row names are supplied of length one and the data frame has a single row, the row.names is taken. To specify the row names and not a column. By name or number. : If row names are supplied of length one and the data frame has a single row, the row.names is taken to specify the row names and not a column (by name or number) Can we go : Please

  • 第 3 场
  • 第 1 组:空 - 不匹配
  • 第 2 组:There is a horror movie running in the iNox theater. If row names are supplied of length one and the data frame has a single row, the row.names is taken to specify the row names and not a column (by name or number). If row names are supplied of length one. : And the data frame has a single row, the row.names is taken to specify the row names and not a column (by name or number) Can we go : Please


  • 解释
  • ^断言字符串开头的位置
  • (?(?!(?:[^:\v]*?\.\h){3,})([^:\v]*?)\s*:\s*|)
  • (?(?!...)x|y) If 语句使用否定 (?!...)作为条件
  • (?:[^:\v]*?\.\h){3,}以下至少匹配3次
  • [^:\v]*?匹配集合中不存在的任何字符(不是冒号或垂直空白字符)任意次数,但尽可能少
  • \.\h逐字匹配点字符,后跟水平空白字符(空格或制表符)
  • If 语句 : 如果满足上述条件,请执行以下操作
  • ([^:\v]*?)\s*:\s*
  • ([^:\v]*?)捕获到第 1 组:任何不出现在集合中的字符(不是冒号或垂直空白字符)任意次数,但尽可能少
  • \s*:\s*匹配任意数量的空格字符,后跟一个冒号,然后是任意数量的空格(请注意,您可以将 * 更改为 +,如果总有至少 1 个空格字符尾随/前导冒号,在以下情况下改进匹配“句子”可能包含 : )
  • If 语句 : 不满足前面的条件,请执行以下操作: 不匹配
  • (.*)捕获到组 2:任何字符(当 s 标志关闭时不包括换行符)任意次数
  • $断言字符串末尾的位置
  • 关于r - 将文本检索到数据帧的两列中的正则表达式模式匹配错误,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/46388016/

    26 4 0
    Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
    广告合作:1813099741@qq.com 6ren.com