使用 .SD 时的 R data.table 慢聚合-6ren

使用 .SD 时的 R data.table 慢聚合

转载作者：行者123 更新时间：2023-12-03 23:36:03

25

4

我正在对 data.table(优秀的包!!!)进行一些聚合，我发现 .SD 变量对很多事情都非常有用。但是，当有很多组时，使用它会显着减慢计算速度。遵循一个例子:

# A moderately big data.table
x = data.table(id=sample(1e4,1e5,replace=T),
               code=factor(sample(2,1e5,replace=T)),
               z=runif(1e5)
              )

setkey(x,id,code)

system.time(x[,list(code2=nrow(.SD[code==2]), total=.N), by=id])
##  user  system elapsed 
##  6.226   0.000   6.242

system.time(x[,list(code2=sum(code==2), total=.N), by=id])
## user  system elapsed 
## 0.497   0.000   0.498

system.time(x[,list(code2=.SD[code==2,.N], total=.N), by=id])
## user  system elapsed 
## 6.152   0.000   6.168

难道我做错了什么？我应该避免使用 .SD 来支持单个列吗？提前致谢。

最佳答案

Am I doing something wrong i.e. should I avoid .SD in favor of individual columns ?

对，就是这样。仅使用 .SD如果你真的使用了 .SD里面的所有数据.您可能还会发现对 nrow() 的调用以及对 [.data.table 的子查询调用内 j也是罪魁祸首:使用 Rprof确认。

见 FAQ 2.1的最后几句:

FAQ 2.1 How can I avoid writing a really long j expression? You've said I should use the column names, but I've got a lot of columns.
When grouping, the j expression can use column names as variables, as you know, but it can also use a reserved symbol .SD which refers to the Subset of the Data.table for each group (excluding the grouping columns). So to sum up all your columns it's just DT[,lapply(.SD,sum),by=grp]. It might seem tricky, but it's fast to write and fast to run. Notice you don't have to create an anonymous function. See the timing vignette and wiki for comparison to other methods. The .SD object is efficiently implemented internally and more ecient than passing an argument to a function. Please don't do this though : DT[,sum(.SD[,"sales",with=FALSE]),by=grp]. That works but is very inefficient and inelegant. This is what was intended: DT[,sum(sales),by=grp] and could be 100's of times faster.

另请参阅常见问题解答 3.1 的第一条:

FAQ 3.1 I have 20 columns and a large number of rows. Why is an expression of one column so quick?
Several reasons:
-- Only that column is grouped, the other 19 are ignored because data.table inspects the j expression and realises it doesn't use the other columns.

当 data.table检查 j并看到 .SD符号，效率增益消失了。它将必须填充整个 .SD即使您不使用其所有列，也可以为每个组设置子集。 data.table很难知道 .SD的哪些列您确实在使用(例如 j 可以包含 if s)。但是，如果您无论如何都需要它们，那当然没有关系，例如在 DT[,lapply(.SD,sum),by=...] .这是 .SD 的理想用途.

所以，是的，避免 .SD尽可能。直接使用列名给出data.table对 j的优化最好的机会。符号的单纯存在 .SD在 j很重要。

这就是为什么 .SDcols被介绍了。所以你可以告诉 data.table哪些列应该在 .SD 中如果你只想要一个子集。否则， data.table将填充 .SD所有列以防万一 j需要他们。

关于使用 .SD 时的 R data.table 慢聚合，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/15273491/

25

4