gpt4 book ai didi

根据重叠模式删除部分字符串

转载 作者:行者123 更新时间:2023-12-03 18:20:53 24 4
gpt4 key购买 nike

我有以下数据:

dat <- data.frame(x               = c("this is my example text", "and here is my other text example", "my other text is short"),
some_other_cols = c(1, 2, 2))

此外,我有以下模式向量:
my_patterns <- c("my example", "is my", "my other text")

我想要实现的是删除 my_patterns 的任何文本发生在 dat$x .

我尝试了下面的解决方案,但问题是,一旦我从文本中删除第一个模式(此处:“我的示例”),我的解决方案就无法检测到第二个模式的出现(此处:“是我的”) ) 或第三种模式了。

错误的解决方案:
library(tidyverse)
my_patterns_c <- str_c(my_patterns, collapse = "|")

dat_new <- dat %>%
mutate(short_x = str_replace_all(x, pattern = my_patterns_c, replacement = ""))

我想我可以做某事。就像遍历所有模式一样,收集 dat$x 中与我的模式匹配的字符串位置,然后将它们组合成一个范围并从文本中删除该范围。例如。我将列添加到我的 dat数据框如 start_pattern_1end_pattern_1等等。因此,对于第一行 1,第一个模式为 9(开始)和 18(结束),第二个模式为 6/10。然后我需要检查是否有 end位置与任何 start 重叠位置(这里开始 9 和结束 10)并将它们组合成范围 6-18 并从文本中删除该范围。

问题是我可能有许多新的开始/结束列(在我的情况下可能是几百个模式),如果我需要成对比较重叠范围,我的计算机可能会崩溃。

所以我想知道如何让它工作或者我应该如何最好地处理这个解决方案。也许(我希望如此)有一个更好/更优雅/更简单的解决方案。

dat 的期望输出将是:
x                                    some_other_cols    short_x
this is my example text 1 this text
and here is my other text example 2 and here example
my other text is short 2 is short

感谢你的帮助!谢谢。

最佳答案

Uwe 在问题下的评论中提到了带有 str_locate_all 的新选项,这大大简化了代码:

library(stringr)
# Create function to remove matching part of text
# First argument is text, second argument is a list of start and length
remove_matching_parts <- function(text, positions) {
if (nrow(positions) == 0) return(text)
ret <- strsplit(text,"")[[1]]
lapply(1:nrow(positions), function(x) { ret[ positions[x,1]:positions[x,2] ] <<- NA } )
paste0(ret[!is.na(ret)],separator="",collapse="")
}

# Loop over the data to apply the pattern
# row = length of vector, columns = length of pattern
matches <- lapply(dat$x, function(x) {
do.call(rbind,str_locate_all(x, my_patterns)) # transform the list output of str_locate in a table of start/end
})

# Avoid growing a vector in a for loop, create it beforehand, it will be the same length as teh vector we work against
dat$result <- vector("character",length(dat$x))
# Loop on each value to remove the matching parts
for (i in 1:length(dat$x)) {
dat$result[i] <- remove_matching_parts(as.character(dat$x[i]),matches[[i]])
}

如果您可以控制模式定义并且可以手动创建它,那么可以使用正则表达式解决方案来实现:
> gsub("(is )?my (other text|example)?","",dat$x)
[1] "this text" "and here example" " is short"

这个想法是用可选部分(分组括号后的 ?)创建模式。

所以我们大致有:
  • (is )? <= 可选的"is"后跟空格
  • my <= 文字“我的”后跟空格
  • (other text|example)? <=“我的”之后的可选文本,“其他文本”或(|)“示例”


  • 如果您没有控制权,事情会变得一团糟,我希望我已经评论了足够多的评论以使其易于理解,根据包含的循环数 不要指望它很快 :
    # Given datas
    dat <- data.frame(x = c("this is my example text", "and here is my other text example", "my other text is short","yet another text"),
    some_other_cols = c(1, 2, 2, 4))

    my_patterns <- c("my example", "is my", "my other text")

    # Create function to remove matching part of text
    # First argument is text, second argument is a list of start and length
    remove_matching_parts <- function(text, positions) {
    ret <- strsplit(text,"")[[1]]
    lapply(positions, function(x) { ifelse(is.na(x),,ret[ x[1]:x[2] ] <<- NA ) } )
    paste0(ret[!is.na(ret)],separator="",collapse="")
    }

    # Create the matches between a vector and a pattern
    # First argument is the pattern to match, second is the vector of charcaters
    match_pat_to_vector <- function(pattern,vector) {
    sapply(regexec(pattern,vector),
    function(x) {
    if(x>-1) {
    c(start=as.numeric(x), end=as.numeric(x+attr(x,"match.length")) ) # Create a start/end vector from the index and length of the match
    }
    })
    }

    # Loop over the patterns to create a dataframe of matches
    # row = length of vector, columns = length of pattern
    matches <- sapply(my_patterns,match_pat_to_vector,vector=dat$x)

    # Avoid growing a vector in a for loop, create it beforehand, it will be the same length as teh vector we work against
    dat$result <- vector("character",length(dat$x))
    # Loop on each value to remove the matching parts
    for (i in 1:length(dat$x)) {
    dat$result[i] <- remove_matching_parts(as.character(dat$x[i]),matches[i,])
    }

    运行后结果:
    > dat
    x some_other_cols result
    1 this is my example text 1 this text
    2 and here is my other text example 2 and here example
    3 my other text is short 2 is short
    4 yet another text 4 yet another text

    关于根据重叠模式删除部分字符串,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/59945610/

    24 4 0
    Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
    广告合作:1813099741@qq.com 6ren.com