gpt4 book ai didi

r - 从 R 中的字符串中提取不同的单词

转载 作者:行者123 更新时间:2023-12-04 11:57:11 25 4
gpt4 key购买 nike

我看过几篇似乎接近回答这个问题的 SO 帖子,但我不知道是否真的有这样的帖子,请原谅我,这是一个重复的帖子。我有几十个字符串(这是数据框中的一列),其中包含不同的数字,通常写成单词,但有时写成整数。例如。:
Three neonates with one adult1 adult, ten neonates nearbyTwo adults and six neonates
我的最终目标是能够从每个字符串中提取新生儿和成人的数量,并得到如下结果:
data.frame(Adults=c(1,1,6), Neonates=c(3,10,6)
但是字符串中数字的数量和位置各不相同。我看到的所有示例都使用 gsub , strsplit等似乎只在用于替换、拆分、提取等的模式在字符串中相同或在字符串中保持恒定位置时才有效。因为我知道数字必须是 c("one","two",...,"ten") ,我可能会遍历每个字符串,然后遍历每个可能的数字以查看它是否存在于字符串中,如果存在,则提取它并转换为数字。但这似乎非常低效。

非常感激任何的帮助!!

最佳答案

使用 str_split 的一种潜在方法来自 stringr包和自定义函数
包装查找匹配和后处理。未提及数据集大小,因此无法测试/评论速度。

library(stringr) #for str_split

customFun = function(
strObj="Three neonates with one adult",
rootOne = "adult",
rootTwo = "neonate"){

#split string
discreteStr = str_split(strObj,pattern = "\\s+",simplify = TRUE)



#find indices of root words
rootOneIndex = grep(rootOne,discreteStr)
rootTwoIndex = grep(rootTwo,discreteStr)

#mapping vectors
charVec = c("one","two","three","four","five","six","seven","eight","nine","ten")
numVec = as.character(1:10)
names(numVec) = charVec

#match index neighbourhood ,-1/+1 and select first match
rootOneMatches = tolower(discreteStr[c(rootOneIndex-1,rootOneIndex+1)])
rootOneMatches = rootOneMatches[!is.na(rootOneMatches)]
rootOneMatches = head(rootOneMatches,1)


rootTwoMatches = tolower(discreteStr[c(rootTwoIndex-1,rootTwoIndex+1)])
rootTwoMatches = rootTwoMatches[!is.na(rootTwoMatches)]
rootTwoMatches = head(rootTwoMatches,1)

#check presence in mapping vectors
rootOneNum = intersect(rootOneMatches,c(charVec,numVec))
rootTwoNum = intersect(rootTwoMatches,c(charVec,numVec))

#final matches and numeric conversion
rootOneFinal = ifelse(!is.na(as.numeric(rootOneNum)),as.numeric(rootOneNum),as.numeric(numVec[rootOneNum]))
rootTwoFinal = ifelse(!is.na(as.numeric(rootTwoNum)),as.numeric(rootTwoNum),as.numeric(numVec[rootTwoNum]))

outDF = data.frame(strObj = strObj,adults = rootOneFinal,neonates = rootTwoFinal,stringsAsFactors=FALSE)
return(outDF)
}

输出:
inputVec = c("Three neonates with one adult","1 adult, ten neonates nearby","Two adults and six neonates")
outputAggDF = suppressWarnings(do.call(rbind,lapply(inputVec,customFun)))

outputAggDF
# strObj adults neonates
#1 Three neonates with one adult 1 3
#2 1 adult, ten neonates nearby 1 10
#3 Two adults and six neonates 2 6

关于r - 从 R 中的字符串中提取不同的单词,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/45720841/

25 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com