gpt4 book ai didi

r - 从文本单元格中提取围绕关键字的多个句子

转载 作者:行者123 更新时间:2023-12-02 02:20:49 25 4
gpt4 key购买 nike

我正在尝试在 R 中搜索大文本中的关键字。一旦找到一个,我想提取该关键字前后的 1 个句子(包括其中包含该关键字的句子)。理想情况下,我希望能够更改此代码以围绕关键字提取最多 3 个句子。下面是示例数据。

text <- "This is an article about random things. Usually, there are a few sentences that are irrelevant to what I am interested in. Then in the middle, there is a sentence that I want to extract. Water quality is a serious concern in Akron, Ohio. It can impact ecological systems and human health. Jon Doe is a key player in this realm. Then the article goes on talking about something else that I don't care about."

keywords <- c("water quality", "health")

因此,对于上面的文本,我想在文本中搜索“水质”和“健康”,当有匹配时,我想从“然后中间有......”提取到“乔恩” Doe 是这个领域的关键参与者。”

最后,我想在多行上重复此操作,每行都有自己的文本。

我已经研究过使用 stringr/regex 但它没有给我我想要的东西 - 我无法提取完整的句子。有什么想法吗?

我尝试过的代码:

str_extract_all(text,paste0("([^\\s+\\s){5}",keywords,"(\\s[^\\s]+){5}"))

-> 这让我两边都说几句话

gsub(".*?([^\\.]*('water quality'|health)[^\\.]*).*","\\1", text, ignore.case = TRUE)

-> 也关闭

最佳答案

使用关键字创建要查找的模式,将数据放入小标题中,将它们分成句子(按句点分割)并选择n-1,对于找到模式的每 n 行,有 nn+1 行。

library(dplyr)
library(tidyr)

keywords <- c("water quality", "health")
pat <- paste0(keywords, collapse = '|')
pat
#[1] "water quality|health"

tibble(text) %>%
separate_rows(text, sep = '\\.\\s*') %>%
slice({
tmp <- grep(pat, text, ignore.case = TRUE)
sort(unique(c(tmp-1, tmp, tmp + 1)))
})

# text
# <chr>
#1 Then in the middle, there is a sentence that I want to extract
#2 Water quality is a serious concern in Akron, Ohio
#3 It can impact ecological systems and human health
#4 Jon Doe is a key player in this realm

关于r - 从文本单元格中提取围绕关键字的多个句子,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/66449007/

25 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com