gpt4 book ai didi

r - 将 "set in a string list"优化为 "set as a matrix"操作

转载 作者:行者123 更新时间:2023-12-03 16:06:42 24 4
gpt4 key购买 nike

我有一组包含空格分隔元素的字符串。我想构建一个矩阵,它会告诉我哪些元素是哪些字符串的一部分。例如:

""
"A B C"
"D"
"B D"

应该给出类似的东西:
  A B C D
1
2 1 1 1
3 1
4 1 1

现在我有了一个解决方案,但它运行得像糖蜜一样慢,而且我没有办法让它更快:
reverseIn <- function(vector, value) {
return(value %in% vector)
}

buildCategoryMatrix <- function(valueVector) {
allClasses <- c()
for(classVec in unique(valueVector)) {
allClasses <- unique(c(allClasses,
strsplit(classVec, " ", fixed=TRUE)[[1]]))
}

resMatrix <- matrix(ncol=0, nrow=length(valueVector))
splitValues <- strsplit(valueVector, " ", fixed=TRUE)

for(cat in allClasses) {
if(cat=="") {
catIsPart <- (valueVector == "")
} else {
catIsPart <- sapply(splitValues, reverseIn, cat)
}
resMatrix <- cbind(resMatrix, catIsPart)
}
colnames(resMatrix) <- allClasses

return(resMatrix)
}

分析函数给了我这个:
$by.self
self.time self.pct total.time total.pct
"match" 31.20 34.74 31.24 34.79
"FUN" 30.26 33.70 74.30 82.74
"lapply" 13.56 15.10 87.86 97.84
"%in%" 12.92 14.39 44.10 49.11

所以我的实际问题是:
- 花在“FUN”上的 33% 来自哪里?
- 有什么办法可以加快 %in% 通话的速度吗?

我尝试在进入循环之前将字符串转换为因子,以便匹配数字而不是字符串,但这实际上使 R 崩溃。我还尝试进行部分矩阵分配(即,resMatrix[i,x] <- 1),其中 i 是字符串的编号,x 是因子向量。那里也没有骰子,因为它似乎一直在无限运行。

最佳答案

在我的“splitstackshape”包的开发版本中,有一个名为 charBinaryMat 的辅助函数可用于如下操作:

Here's the function(因为 CRAN 上的软件包版本还没有):

charBinaryMat <- function(listOfValues, fill = NA) {
lev <- sort(unique(unlist(listOfValues, use.names = FALSE)))
m <- matrix(fill, nrow = length(listOfValues), ncol = length(lev))
colnames(m) <- lev
for (i in 1:nrow(m)) {
m[i, listOfValues[[i]]] <- 1
}
m
}

输入应该是 strsplit 的输出:

它正在使用中:
str <- c("" , "A B C" , "D" , "B D" )

## Fill is `NA` by default
charBinaryMat(strsplit(str, " ", fixed=TRUE))
# A B C D
# [1,] NA NA NA NA
# [2,] 1 1 1 NA
# [3,] NA NA NA 1
# [4,] NA 1 NA 1

## Can easily be set to another value
charBinaryMat(strsplit(str, " ", fixed=TRUE), fill = 0)
# A B C D
# [1,] 0 0 0 0
# [2,] 1 1 1 0
# [3,] 0 0 0 1
# [4,] 0 1 0 1

基准测试

由于您的问题是关于更快的方法,让我们进行基准测试。
  • 基准测试函数:
    CBM <- function() {
    charBinaryMat(strsplit(str, " ", fixed=TRUE), fill = 0)
    }
    BCM <- function() {
    buildCategoryMatrix(str)*1L
    }
    Sapply <- function() {
    y <- unique( unlist( strsplit( str , " " ) ) )
    out <- t(sapply(str, function(x) y %in% unlist(strsplit(x , " " )),
    USE.NAMES = FALSE )) * 1L
    colnames(out) <- y
    out
    }
  • 一些示例数据:
    set.seed(1)
    A = sample(10, 100000, replace = TRUE)
    str <- sapply(seq_along(A), function(x)
    paste(sample(LETTERS[1:10], A[x]), collapse = " "))
    head(str)
    # [1] "H G C" "F H J G" "H D J A I B"
    # [4] "A C F H J B E G D I" "F C H" "I C G B J D F A E"
  • 一些示例输出:
    ## Automatically sorted
    head(CBM())
    # A B C D E F G H I J
    # [1,] 0 0 1 0 0 0 1 1 0 0
    # [2,] 0 0 0 0 0 1 1 1 0 1
    # [3,] 1 1 0 1 0 0 0 1 1 1
    # [4,] 1 1 1 1 1 1 1 1 1 1
    # [5,] 0 0 1 0 0 1 0 1 0 0
    # [6,] 1 1 1 1 1 1 1 0 1 1

    ## Sorting just for comparison
    head(BCM())[, LETTERS[1:10]]
    # A B C D E F G H I J
    # [1,] 0 0 1 0 0 0 1 1 0 0
    # [2,] 0 0 0 0 0 1 1 1 0 1
    # [3,] 1 1 0 1 0 0 0 1 1 1
    # [4,] 1 1 1 1 1 1 1 1 1 1
    # [5,] 0 0 1 0 0 1 0 1 0 0
    # [6,] 1 1 1 1 1 1 1 0 1 1

    ## Sorting just for comparison
    head(Sapply())[, LETTERS[1:10]]
    # A B C D E F G H I J
    # [1,] 0 0 1 0 0 0 1 1 0 0
    # [2,] 0 0 0 0 0 1 1 1 0 1
    # [3,] 1 1 0 1 0 0 0 1 1 1
    # [4,] 1 1 1 1 1 1 1 1 1 1
    # [5,] 0 0 1 0 0 1 0 1 0 0
    # [6,] 1 1 1 1 1 1 1 0 1 1
  • 基准测试:
    library(microbenchmark)
    microbenchmark(CBM(), BCM(), Sapply(), times=20)
    # Unit: milliseconds
    # expr min lq median uq max neval
    # CBM() 675.0929 718.3454 777.2423 805.3872 858.6609 20
    # BCM() 11059.6305 11267.9888 11367.3283 11595.1758 11792.5950 20
    # Sapply() 3536.7755 3687.0308 3759.7388 3813.4233 3968.3192 20
  • 关于r - 将 "set in a string list"优化为 "set as a matrix"操作,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/19594253/

    24 4 0
    Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
    广告合作:1813099741@qq.com 6ren.com