gpt4 book ai didi

r - R中近似子串匹配的位置

转载 作者:行者123 更新时间:2023-12-05 02:21:53 27 4
gpt4 key购买 nike

我正在使用 R 进行字符串处理。我有一个包含一列字符串的数据框,例如:

 df <- data.frame(textcol=c("In this substring would like to find the position of this substring",
"I would also like to find the position of thes substring",
"No match here","No mention of this substrangy thing"))

matchPattern <- "this substring"

我正在搜索一个函数(取决于某种距离参数,比如 Jarro-Winkler)将采用我的 matchPattern,将其与数据框文本列的每一行进行比较,并返回匹配项的确切位置在匹配的字符串中,即第一个元素为 36(除非我算错了),第二个元素(可能)为 43,第三个为 NA,第四个为 14(?)。

最佳答案

你可以使用aregexec

## Get positions (-1 instead of NA)
positions <- aregexec(matchPattern, df$textcol, max.distance = 0.1)
unlist(positions)
# [1] 38 43 -1 15

## Extract matches
regmatches(df$textcol, positions)
# [[1]]
# [1] "this substring"
#
# [[2]]
# [1] "thes substring"
#
# [[3]]
# character(0)
#
# [[4]]
# [1] "this substrang"

编辑

## A possibilty for replacing matches, or maybe `regmatches<-`
res <- regmatches(df$textcol, positions)
res[lengths(res)==0] <- "XXXX" # deal with 0 length matches somehow
df$out <- Vectorize(gsub)(unlist(res), "Censored", df$textcol)
df$out
# [1] "I would like to find the position of Censored"
# [2] "I would also like to find the position of Censored"
# [3] "No match here"
# [4] "No mention of Censoredy thing"

关于r - R中近似子串匹配的位置,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/31843171/

27 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com