gpt4 book ai didi

r - 在 R 中将字符串拆分为固定长度元素的最快方法

转载 作者:行者123 更新时间:2023-12-05 09:21:27 24 4
gpt4 key购买 nike

如何在 R 中将字符串拆分为固定长度的元素是一个常见问题,典型答案依赖于 substring(x)strsplit(x, sep="") 后跟 paste(y, collapse = "")。例如,通过指定 3 个字符的固定长度,可以将字符串 "azertyuiop" 拆分为 "aze"、"rty"、"uio"、"p"

我正在寻找最快的方法。在对长字符串(> 1000 个字符)进行一些测试后,我发现 substring() 太慢了。因此,该策略是将字符串拆分为单个字符,然后通过应用一些技巧将它们粘贴回所需长度的组。

这是我能想到的最快的函数。这个想法是将字符串拆分成单独的字符,然后在字符向量的正确位置插入一个分隔符,将字符(和分隔符)折叠回一个字符串,然后再次拆分字符串,但这次指定分隔符。

splitInParts <- function(string, size) {              #can process a vector of strings. "size" is the length of desired substrings
chars <- strsplit(string,"",T)
lengths <- nchar(string)
nFullGroups <- floor(lengths/size) #the number of complete substrings of the desired size

#here we prepare a vector of separators (comas), which we will replace by the characters, except at the positions that will have to separate substring groups of length "size". Assumes that the string doesn't have any comas.
seps <- Map(rep, ",", lengths + nFullGroups) #so the seps vector is longer than the chars vector, because there are separators (as may as they are groups)
indices <- Map(seq, 1, lengths + nFullGroups) #the positions at which separators will be replaced by the characters
indices <- lapply(indices, function(x) which(x %% (size+1) != 0)) #those exclude the positions at which we want to retain the separators (I haven't found a better way to generate such vector of indices)

temp <- function(x,y,z) { #a fonction describing the replacement, because we call it in the Map() call below
x[y] <- z
x
}
res <- Map(temp, seps, indices, chars) #so now we have a vector of chars with separators interspersed
res <- sapply(res, paste, collapse="", USE.NAMES=F) #collapses the characters and separators
res <- strsplit(res, ",", T) #and at last, we can split the strings into elements of the desired length
}

这看起来很乏味,但我尝试简单地将 chars 向量放入具有足够行数的矩阵中,然后使用 apply(mat, 2,粘贴,折叠=“”)。这要慢得多。而使用 split() 将字符向量拆分为长度合适的向量列表,以便折叠元素,速度更慢。

所以如果你能更快地找到东西,请告诉我。如果没有,那么我的功能可能会有一些用处。 :)

最佳答案

阅读更新很有趣,所以我进行了基准测试:

> nchar(mystring)
[1] 260000

我的想法与@akrun 的想法几乎相同,因为 str_extract_all 在幕后使用相同的功能 IIRC)

library(stringr)
tensiSplit <- function(string,size) {
str_extract_all(string, paste0('.{1,',size,'}'))
}

以及我机器上的结果:

> microbenchmark(splitInParts(mystring,3),akrunSplit(mystring,3),splitInParts2(mystring,3),tensiSplit(mystring,3),gsubSplit(mystring,3),times=3)
Unit: milliseconds
expr min lq mean median uq max neval
splitInParts(mystring, 3) 64.80683 64.83033 64.92800 64.85384 64.98858 65.12332 3
akrunSplit(mystring, 3) 4309.19807 4315.29134 4330.40417 4321.38461 4341.00722 4360.62983 3
splitInParts2(mystring, 3) 21.73150 21.73829 21.90200 21.74507 21.98725 22.22942 3
tensiSplit(mystring, 3) 21.80367 21.85201 21.93754 21.90035 22.00447 22.10859 3
gsubSplit(mystring, 3) 53.90416 54.28191 54.55416 54.65966 54.87915 55.09865 3

关于r - 在 R 中将字符串拆分为固定长度元素的最快方法,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/32398301/

24 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com