r - 如何优化 R 中矩阵子部分的读写(可能使用 data.table)-6ren

r - 如何优化 R 中矩阵子部分的读写(可能使用 data.table)

转载作者：行者123 更新时间：2023-12-04 08:53:43

TL;博士

What is the fastest method in R for reading and writing a subset of columns from a very large matrix. I attempt a solution with data.table but need a fast way to extract a sequence of columns?

Short Answer: The expensive part of the operation is assignment. Thus the solution is to stick with a matrix and use Rcpp and C++ to modify the matrix in place. There are two excellent answers below with examples.[for those applying to other problems be sure to read the disclaimers in the solutions!]. Scroll to the bottom of the question for some more lessons learned.

这是我的第一个 Stack Overflow 问题 - 我非常感谢您抽出时间来查看，如果我遗漏了任何内容，我深表歉意。我正在研究一个 R 包，我在子集和写入矩阵的部分时遇到了性能瓶颈(对于统计学家来说，应用程序在处理每个数据点后更新足够的统计数据)。单个操作的速度非常快，但它们的绝对数量要求它尽可能快。该想法的最简单版本是维度为 K 乘以 V 的矩阵，其中 K 通常介于 5 和 1000 之间，而 V 可以介于 1000 和 1,000,000 之间。

set.seed(94253)
K <- 100
V <- 100000
mat <-  matrix(runif(K*V),nrow=K,ncol=V)

然后我们最终对列的子集执行计算并将其添加到完整矩阵中。
因此天真地看起来像

Vsub <- sample(1:V, 20)
toinsert <- matrix(runif(K*length(Vsub)), nrow=K, ncol=length(Vsub))
mat[,Vsub] <- mat[,Vsub] + toinsert
library(microbenchmark)
microbenchmark(mat[,Vsub] <- mat[,Vsub] + toinsert)

因为这样做了很多次，由于 R 的更改时复制语义，它可能会非常慢(但请参阅下面的经验教训，在某些情况下实际上可以进行修改)。

对于我的问题，对象不必是矩阵(并且我对此处概述的差异很敏感 Assign a matrix to a subset of a data.table )。我总是想要完整的列，因此数据框的列表结构很好。我的解决方案是使用 Matthew Dowle 很棒的 data.table 包。使用 set() 可以非常快速地完成写入。不幸的是，获得值(value)有点复杂。我们必须使用=FALSE 调用变量设置，这会大大减慢速度。

library(data.table)
DT <- as.data.table(mat)  
set(DT, i=NULL, j=Vsub,DT[,Vsub,with=FALSE] + as.numeric(toinsert))

在 set() 函数中，使用 i=NULL 来引用所有行的速度非常快，但是(可能是由于事物在后台存储的方式)j 没有可比的选项。 @Roland 在评论中指出，一种选择是转换为三重表示(行号、列号、值)并使用 data.tables 二进制搜索来加速检索。我手动测试了它，虽然它很快，但它确实使矩阵的内存需求大约增加了三倍。如果可能的话，我想避免这种情况。

跟随这里的问题: Time in getting single elemets from data.table and data.frame objects . Hadley Wickham 为单个索引提供了令人难以置信的快速解决方案

Vone <- Vsub[1]
toinsert.one <- toinsert[,1]
set(DT, i=NULL, j=Vone,(.subset2(DT, Vone) + toinsert.one))

但是，由于 .subset2(DT,i) 只是 DT[[i]] 没有方法分派(dispatch)，因此(据我所知)无法一次抓取几列，尽管它看起来确实应该是可能的。与上一个问题一样，看起来既然我们可以快速覆盖这些值，我们应该能够快速读取它们。

有什么建议么？另外请让我知道是否有比 data.table 更好的解决方案来解决这个问题。我意识到它在许多方面并不是真正的预期用例，但我试图避免将整个系列的操作移植到 C 中。

以下是讨论的元素的时序序列——前两个都是列，后两个只是一列。

microbenchmark(mat[,Vsub] <- mat[,Vsub] + toinsert,
              set(DT, i=NULL, j=Vsub,DT[,Vsub,with=FALSE] + as.numeric(toinsert)),
              mat[,Vone] <- mat[,Vone] + toinsert.one,
              set(DT, i=NULL, j=Vone,(.subset2(DT, Vone) + toinsert.one)),
              times=1000L)

Unit: microseconds
                  expr      min       lq   median       uq       max neval
                Matrix   51.970   53.895   61.754   77.313   135.698  1000
            Data.Table 4751.982 4962.426 5087.376 5256.597 23710.826  1000
     Matrix Single Col    8.021    9.304   10.427   19.570 55303.659  1000
 Data.Table Single Col    6.737    7.700    9.304   11.549    89.824  1000

答案和经验教训:

Comments identified the most expensive part of the operation as the assignment process. Both solutions give answers based on C code which modify the matrix in place breaking R convention of not modifying the argument to a function but providing a much faster result.

Hadley Wickham stopped by in the comments to note that the matrix modification is actually done in place as long as the object mat is not referenced elsewhere (see http://adv-r.had.co.nz/memory.html#modification-in-place). This points to an interesting and subtle point. I was performing my evaluations in RStudio. RStudio as Hadley notes in his book creates an additional reference for every object not within a function. Thus while in the performance case of a function the modification would happen in place, at the command line it was producing a copy-on-change effect. Hadley's package pryr has some nice functions for tracking references and addresses of memory.

最佳答案

Rcpp 的乐趣:

您可以使用Eigen's Map class就地修改 R 对象。

library(RcppEigen)
library(inline)

incl <- '
using  Eigen::Map;
using  Eigen::MatrixXd;
using  Eigen::VectorXi;
typedef  Map<MatrixXd>  MapMatd;
typedef  Map<VectorXi>  MapVeci;
'

body <- '
MapMatd              A(as<MapMatd>(AA));
const MapMatd        B(as<MapMatd>(BB));
const MapVeci        ix(as<MapVeci>(ind));
const int            mB(B.cols());
for (int i = 0; i < mB; ++i) 
{
A.col(ix.coeff(i)-1) += B.col(i);
}
'

funRcpp <- cxxfunction(signature(AA = "matrix", BB ="matrix", ind = "integer"), 
                       body, "RcppEigen", incl)

set.seed(94253)
K <- 100
V <- 100000
mat2 <-  mat <-  matrix(runif(K*V),nrow=K,ncol=V)

Vsub <- sample(1:V, 20)
toinsert <- matrix(runif(K*length(Vsub)), nrow=K, ncol=length(Vsub))
mat[,Vsub] <- mat[,Vsub] + toinsert

invisible(funRcpp(mat2, toinsert, Vsub))
all.equal(mat, mat2)
#[1] TRUE

library(microbenchmark)
microbenchmark(mat[,Vsub] <- mat[,Vsub] + toinsert,
               funRcpp(mat2, toinsert, Vsub))
# Unit: microseconds
#                                  expr    min     lq  median      uq       max neval
# mat[, Vsub] <- mat[, Vsub] + toinsert 49.273 49.628 50.3250 50.8075 20020.400   100
#         funRcpp(mat2, toinsert, Vsub)  6.450  6.805  7.6605  7.9215    25.914   100

我认为这基本上是@Joshua Ulrich 提出的。他关于打破 R 的功能范式的警告适用。

我在 C++ 中进行了加法，但是将函数更改为仅进行赋值是微不足道的。

显然，如果您可以在 Rcpp 中实现整个循环，您就可以避免在 R 级别重复调用函数并获得性能。

关于r - 如何优化 R 中矩阵子部分的读写(可能使用 data.table)，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/20125947/

文章推荐： ruby-on-rails - Ruby on Rails : Nested named scopes

文章推荐： wordpress - 从页面编辑屏幕管理小部件

文章推荐： excel - 图片/表格工具不会出现在 Excel 的功能区上

文章推荐： apache-spark - 使用 pyspark 从 s3 读取/加载 avro 文件

typescript - A 部分部分 io-ts
我在使用 io-ts 时遇到一些问题。我发现它确实缺乏文档，我取得的大部分进展都是通过 GitHub issues 取得的。不，我不明白 HKT，所以没有帮助。基本上，我在其他地方创建一个类型，ty
java - 匹配完整文件正则表达式中的 A 部分，但不匹配 B 部分
我必须创建一个正则表达式来搜索整个文件，以找到与 Java XML 解析器的第一部分(但不是第二部分)的匹配项。这将用于防止某些 XXE 攻击。不幸的是，它确实必须是单个正则表达式，并且它确实需要搜索
c# - 部分/部分中的 asp.net mvs 部分？
我有一些简单的 Shared/_Header.cshtml 文件中的内容。 My Shared/_Layout.cshtml 通过调用插入该代码 @Html.Partial("_Header") 目前
java - Selenium 只执行循环的 if != null 部分，不运行循环的 "else if null "部分
我有一个 if-else 语句，其中: 条件 1:ID 匹配并且自动填充某些字段。然后 if 语句只填充其余字段条件 2:ID 不匹配，所有字段均为空白。 ELSE 语句将它们全部填充当我使条件
javascript - 无法在 JSFIDDLE 中使用滚动魔法(第 1 部分，共 2 部分)
我正在开发一个单页滚动网站。我正在尝试实现 ScrollMagic 并固定第一部分，以便网站的其余部分滚动到固定部分的顶部。我尝试创建一个 jsfiddle 来显示问题，但我似乎无法让 jsfiddl
javascript - 既然有

首页

博学

6Ren·AI

商城

r - 如何优化 R 中矩阵子部分的读写(可能使用 data.table)