gpt4 book ai didi

r - 如何使 R 矩阵填充函数更快?

转载 作者:行者123 更新时间:2023-12-02 13:34:43 24 4
gpt4 key购买 nike

不久前,我编写了一个函数来填充时间序列矩阵,该矩阵的 NA 值根据所需的规范而增加,并且它偶尔会在一些大约 50000 行、350 列的矩阵上使用。矩阵可以包含数字或字符值。主要问题是修复矩阵很慢,我想我应该向一些专家咨询如何更快地完成此任务。

我想 rcpp 或并行它可能会有所帮助,但我认为这可能是我的设计而不是 R 本身效率低下。我通常对 R 中的所有内容进行 vecotrize,但由于缺失值不遵循任何模式,因此除了按行处理矩阵之外,我没有找到其他方法。

需要调用该函数,以便它可以继承缺失值,也可以调用该函数以快速用最后一个已知值填充最新值。

这是一个示例矩阵:

testMatrix <- structure(c(NA, NA, NA, 29.98, 66.89, NA, -12.78, -11.65, NA, 
4.03, NA, NA, NA, 29.98, 66.89, NA, -12.78, -11.65, NA, NA, NA,
NA, NA, 29.98, 66.89, NA, -12.78, NA, NA, 4.76, NA, NA, NA, NA,
66.89, NA, -12.78, NA, NA, 4.76, NA, NA, NA, 29.98, 66.89, NA,
-12.78, NA, NA, 4.76, NA, NA, NA, 29.98, 66.89, NA, -12.78, NA,
NA, 4.39, NA, NA, NA, 29.98, 66.89, NA, -10.72, -11.65, NA, 4.39,
NA, NA, NA, 29.98, 50.65, NA, -10.72, -11.65, NA, 4.39, NA, NA,
4.72, NA, 50.65, NA, -10.72, -38.61, 45.3, NA), .Dim = c(10L,
9L), .Dimnames = list(c("ID_a", "ID_b", "ID_c", "ID_d", "ID_e",
"ID_f", "ID_g", "ID_h", "ID_i", "ID_j"), c("2010-09-30", "2010-10-31",
"2010-11-30", "2010-12-31", "2011-01-31", "2011-02-28", "2011-03-31",
"2011-04-30", "2011-05-31")))

print(testMatrix)
2010-09-30 2010-10-31 2010-11-30 2010-12-31 2011-01-31 2011-02-28 2011-03-31 2011-04-30 2011-05-31
ID_a NA NA NA NA NA NA NA NA NA
ID_b NA NA NA NA NA NA NA NA NA
ID_c NA NA NA NA NA NA NA NA 4.72
ID_d 29.98 29.98 29.98 NA 29.98 29.98 29.98 29.98 NA
ID_e 66.89 66.89 66.89 66.89 66.89 66.89 66.89 50.65 50.65
ID_f NA NA NA NA NA NA NA NA NA
ID_g -12.78 -12.78 -12.78 -12.78 -12.78 -12.78 -10.72 -10.72 -10.72
ID_h -11.65 -11.65 NA NA NA NA -11.65 -11.65 -38.61
ID_i NA NA NA NA NA NA NA NA 45.30
ID_j 4.03 NA 4.76 4.76 4.76 4.39 4.39 4.39 NA

这是我当前使用的功能:

# ----------------------------------------------------------------------------
# GetMatrixWithBlanksFilled
# ----------------------------------------------------------------------------
#
# Arguments:
# inputMatrix --- A matrix with gaps in the time series rows
# fillGapMax --- The max number of columns to carry a number
# forward if there are no more values in the
# time series row.
#
# Returns:
# A matrix with gaps filled.

GetMatrixWithBlanksFilled <- function(inputMatrix, fillGapMax = 6, forwardLooking = TRUE) {

if("DEBUG_ON" %in% ls(globalenv())){browser()}

cntRow <- nrow(inputMatrix)
cntCol <- ncol(inputMatrix)

#
if (forwardLooking) {
for (i in 1:cntRow) {
# Store the location of the first non NA element in the row
firstValueCol <- (1:cntCol)[!is.na(inputMatrix[i,])][1]
if (!(is.na(firstValueCol))) {
if (!(firstValueCol == cntCol)) {
nextValueCol <- firstValueCol
# If there is a a value number in the row and it's not at the end of the time
# series, start iterating through the row while there are more NA values and
# more data values and not at the end of the row continue.
while ((sum(as.numeric(is.na(inputMatrix[i,nextValueCol:cntCol]))))>0 && (sum(as.numeric(!is.na(inputMatrix[i,nextValueCol:cntCol]))))>0 && !(nextValueCol == cntCol)) {
# Find the next NA element
nextNaCol <- (nextValueCol:cntCol)[is.na(inputMatrix[i,nextValueCol:cntCol])][1]
# Find the next value element
nextValueCol <- (nextNaCol:cntCol)[!is.na(inputMatrix[i,nextNaCol:cntCol])][1]
# If there is another value element then fill up all NA elements in between with the last known value
if (!is.na(nextValueCol)) {
inputMatrix[i,nextNaCol:(nextValueCol-1)] <- inputMatrix[i,(nextNaCol-1)]
} else {
# If there is no other value element then fill up all NA elements up to the max number supplied
# with the last known value unless it's close to the end of the row then just fill up to the end.
inputMatrix[i,nextNaCol:min(nextNaCol+fillGapMax,cntCol)] <- inputMatrix[i,(nextNaCol-1)]
nextValueCol <- cntCol
}
}
}
}
}
} else {
for (i in 1:cntRow) {
if (is.na(inputMatrix[i,ncol(inputMatrix)])) {
tempRow <- inputMatrix[i,max(1,length(inputMatrix[i,])-fillGapMax):length(inputMatrix[i,])]
if (length(tempRow[!is.na(tempRow)])>0) {
lastNonNaLocation <- (length(tempRow):1)[!is.na(tempRow)][length(tempRow[!is.na(tempRow)])]
inputMatrix[i,(ncol(inputMatrix)-lastNonNaLocation+2):ncol(inputMatrix)] <- tempRow[!is.na(tempRow)][length(tempRow[!is.na(tempRow)])]
}
}
}
}

return(inputMatrix)
}

然后我用类似的方式调用它:

> fixedMatrix1 <- GetMatrixWithBlanksFilled(testMatrix,fillGapMax=12,forwardLooking=TRUE)
> print(fixedMatrix1)
2010-09-30 2010-10-31 2010-11-30 2010-12-31 2011-01-31 2011-02-28 2011-03-31 2011-04-30 2011-05-31
ID_a NA NA NA NA NA NA NA NA NA
ID_b NA NA NA NA NA NA NA NA NA
ID_c NA NA NA NA NA NA NA NA 4.72
ID_d 29.98 29.98 29.98 29.98 29.98 29.98 29.98 29.98 29.98
ID_e 66.89 66.89 66.89 66.89 66.89 66.89 66.89 50.65 50.65
ID_f NA NA NA NA NA NA NA NA NA
ID_g -12.78 -12.78 -12.78 -12.78 -12.78 -12.78 -10.72 -10.72 -10.72
ID_h -11.65 -11.65 -11.65 -11.65 -11.65 -11.65 -11.65 -11.65 -38.61
ID_i NA NA NA NA NA NA NA NA 45.30
ID_j 4.03 4.03 4.76 4.76 4.76 4.39 4.39 4.39 4.39

> fixedMatrix2 <- GetMatrixWithBlanksFilled(testMatrix,fillGapMax=1,forwardLooking=FALSE)
> print(fixedMatrix2)
2010-09-30 2010-10-31 2010-11-30 2010-12-31 2011-01-31 2011-02-28 2011-03-31 2011-04-30 2011-05-31
ID_a NA NA NA NA NA NA NA NA NA
ID_b NA NA NA NA NA NA NA NA NA
ID_c NA NA NA NA NA NA NA NA 4.72
ID_d 29.98 29.98 29.98 NA 29.98 29.98 29.98 29.98 29.98
ID_e 66.89 66.89 66.89 66.89 66.89 66.89 66.89 50.65 50.65
ID_f NA NA NA NA NA NA NA NA NA
ID_g -12.78 -12.78 -12.78 -12.78 -12.78 -12.78 -10.72 -10.72 -10.72
ID_h -11.65 -11.65 NA NA NA NA -11.65 -11.65 -38.61
ID_i NA NA NA NA NA NA NA NA 45.30
ID_j 4.03 NA 4.76 4.76 4.76 4.39 4.39 4.39 4.39

这个示例运行得很快,但是有什么方法可以让它对于大型矩阵更快吗?

> n <- 38
> m <- 5000
> bigM <- matrix(rep(testMatrix,n*m),m*nrow(testMatrix),n*ncol(testMatrix),FALSE)
> system.time(output <- GetMatrixWithBlanksFilled(bigM,fillGapMax=12,forwardLooking=TRUE))
user system elapsed
86.47 0.06 87.24

这个虚拟行有很多 NA 行和完全填充的行,但普通行可能需要大约 15-20 分钟。

更新

关于 Charles 对 na.locf 的评论并不完全反射(reflect)上述逻辑:下面是最终函数如何排除输入等检查的简化版本:

FillGaps <- function( dataMatrix, fillGapMax ) {

require("zoo")

numRow <- nrow(dataMatrix)
numCol <- ncol(dataMatrix)

iteration <- (numCol-fillGapMax)

if(length(iteration)>0) {
for (i in iteration:1) {
tempMatrix <- dataMatrix[,i:(i+fillGapMax),drop=FALSE]
tempMatrix <- t(zoo::na.locf(t(tempMatrix), na.rm=FALSE, maxgap=fillGapMax))
dataMatrix[,i:(i+fillGapMax)] <- tempMatrix
}
}

return(dataMatrix)
}

最佳答案

我可能是错的,但我认为这是在 zoo 中实现的包:使用na.locf函数。

对于给定的示例矩阵,首先我们应该转置它,在调用 na 函数后,我们“重新转置”结果矩阵。例如:

> t(na.locf(t(testMatrix), na.rm=FALSE, maxgap=12))
2010-09-30 2010-10-31 2010-11-30 2010-12-31 2011-01-31 2011-02-28 2011-03-31 2011-04-30 2011-05-31
ID_a NA NA NA NA NA NA NA NA NA
ID_b NA NA NA NA NA NA NA NA NA
ID_c NA NA NA NA NA NA NA NA 4.72
ID_d 29.98 29.98 29.98 29.98 29.98 29.98 29.98 29.98 29.98
ID_e 66.89 66.89 66.89 66.89 66.89 66.89 66.89 50.65 50.65
ID_f NA NA NA NA NA NA NA NA NA
ID_g -12.78 -12.78 -12.78 -12.78 -12.78 -12.78 -10.72 -10.72 -10.72
ID_h -11.65 -11.65 -11.65 -11.65 -11.65 -11.65 -11.65 -11.65 -38.61
ID_i NA NA NA NA NA NA NA NA 45.30
ID_j 4.03 4.03 4.76 4.76 4.76 4.39 4.39 4.39 4.39

并且使用较小的maxgap:

> t(na.locf(t(testMatrix), na.rm=FALSE, maxgap=0))
2010-09-30 2010-10-31 2010-11-30 2010-12-31 2011-01-31 2011-02-28 2011-03-31 2011-04-30 2011-05-31
ID_a NA NA NA NA NA NA NA NA NA
ID_b NA NA NA NA NA NA NA NA NA
ID_c NA NA NA NA NA NA NA NA 4.72
ID_d 29.98 29.98 29.98 NA 29.98 29.98 29.98 29.98 NA
ID_e 66.89 66.89 66.89 66.89 66.89 66.89 66.89 50.65 50.65
ID_f NA NA NA NA NA NA NA NA NA
ID_g -12.78 -12.78 -12.78 -12.78 -12.78 -12.78 -10.72 -10.72 -10.72
ID_h -11.65 -11.65 NA NA NA NA -11.65 -11.65 -38.61
ID_i NA NA NA NA NA NA NA NA 45.30
ID_j 4.03 NA 4.76 4.76 4.76 4.39 4.39 4.39 NA

可以看到使用na.locf获得的性能:

>  system.time(output <- GetMatrixWithBlanksFilled(bigM,fillGapMax=12,forwardLooking=TRUE))
user system elapsed
79.238 0.540 80.398
> system.time(output <- t(na.locf(t(bigM), na.rm=FALSE, maxgap=12)))
user system elapsed
17.129 0.267 17.513

关于r - 如何使 R 矩阵填充函数更快?,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/6370424/

24 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com