gpt4 book ai didi

python - 从 R 或 Python 中的列中获取经常出现的字符串模式

转载 作者:太空狗 更新时间:2023-10-30 01:09:05 26 4
gpt4 key购买 nike

我有一列字符串名称,我想找到经常出现的模式(单词)。有没有办法返回长度大于(或等于)X 的字符串,并且在整个列中出现的次数多于 Y 次?

column <- c("bla1okay", "okay1243bla", "blaokay", "bla12okay", "okaybla")
getOftenOccuringPatterns <- function(.....)
getOftenOccuringPatterns(column, atleaststringsize=3, atleasttimes=4)
> what times
[1] bla 5
[2] okay 5

引用Tim的评论:

我想删除嵌套的,所以如果有“aaaaaaa”和“aaaa”并且两者都会出现在输出中,则只有“aaaaaaa”和出现的次数才算在内。

如果 atleaststringsize=3atleaststringsize=4,两者的输出将相同。假设 atleasttimes=10,“aaaaaaaa”出现 15 次,“aaaaaa”出现 15 次,那么:

getOftenOccurringPatterns(column, atleaststringsize=3, atleasttimes=10)
> what times
[1] aaaaaaaa 15

getOftenOccurringPatterns(column, atleaststringsize=4, atleasttimes=10) 
> what times
[1] aaaaaaaa 15

停留时间最长的一个,atleast=3和atleast=4都是一样的。

最佳答案

它没有经过任何测试,也不会赢得任何速度比赛:

getOftenOccuringPatterns <- function(column, atleaststringsize, atleasttimes, uniqueInColumns = FALSE){

res <-
lapply(column,function(x){
lapply(atleaststringsize:nchar(x),function(y){
if(uniqueInColumns){
unique(substring(x, 1:(nchar(x)-y+1), y:nchar(x)))
}else{
substring(x, 1:(nchar(x)-y+1), y:nchar(x))
}
})
})

orderedRes <- unlist(res)[order(unlist(res))]
encodedRes <- rle(orderedRes)

partRes <- with(encodedRes, {check = (lengths >= atleasttimes);
list(what = values[check], times = lengths[check])})
testRes <- sapply(partRes$what, function(x){length(grep(x, partRes$what)) > 1})

lapply(partRes, '[', !testRes)

}


column <- c("bla1okay", "okay1243bla", "blaokay", "bla12okay", "okaybla")
getOftenOccuringPatterns(column, atleaststringsize=3, atleasttimes=4)
$what

"bla" "okay"

$times

5 5


getOftenOccuringPatterns(c("aaaaaaaa", "aaaaaaa", "aaaaaa", "aaaaa", "aaaa", "aaa"), atleaststringsize=3, atleasttimes=4)
$what
[1] "aaaaaa"

$times
[1] 6


getOftenOccuringPatterns(c("aaaaaaaa", "aaaaaaa", "aaaaaa", "aaaaa", "aaaa", "aaa"), atleaststringsize=3, atleasttimes=4, uniqueInColumn = TRUE)
$what
[1] "aaaaa"

$times
[1] 4

关于python - 从 R 或 Python 中的列中获取经常出现的字符串模式,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/16757306/

26 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com