gpt4 book ai didi

r - 在data.table中有条件替换数据值的最快方法(速度比较)

转载 作者:行者123 更新时间:2023-12-02 23:30:17 25 4
gpt4 key购买 nike

为什么第二种方法会因为增加data.table大小而变慢:

library(data.table)
DF = data.table(x=rep(c("a","b","c"),each=40000000), y=sample(c(1,3,6),40000000,T), v=1:9)

1:

DF1=DF2=DF

system.time(DF[y==6,"y"]<-10)
user system elapsed
2.793 0.699 3.497

2:

system.time(DF1$y[DF1$y==6]<-10)
user system elapsed
6.525 1.555 8.107

3:

system.time(DF2[y==6, y := 10]) # slowest!
user system elapsed
7.925 0.626 8.569

>sessionInfo()
R version 3.2.1 (2015-06-18)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 14.04.3 LTS

有没有更快的方法来做到这一点?

最佳答案

在您的最后一个案例中,这是自 v1.9.4+ 以来 data.table自动索引功能的结果。阅读更多内容以获取完整图片:-)。

当您执行 DT[col == .]DT[col %in% .] 时,系统会在您第一次运行时自动生成索引。索引只是您指定的列的顺序。索引的计算非常快(使用计数排序/真基数排序)。

该表有 1.2 亿行,大约需要:

# clean session
require(data.table)
set.seed(1L)
DF = data.table(x=rep(c("a","b","c"),each=40000000), y=sample(c(1,3,6),40000000,T), v=1:9)

system.time(data.table:::forderv(DF, "y"))
# 3.923 0.736 4.712

Side note: Column y need not be really double (on which ordering takes longer). If we convert it to integer type:

   DF[, y := as.integer(y)]
system.time(data.table:::forderv(DF, "y"))
# user system elapsed
# 0.569 0.140 0.717

优点是该列上使用 ==%in% 的任何后续子集都将非常快( SlidesR scriptvideo马特的演讲)。例如:

# clean session, copy/paste code from above to create DF
system.time(DF[y==6, y := 10])
# user system elapsed
# 4.750 1.121 5.932

system.time(DF[y==6, y := 10])
# user system elapsed
# 4.002 0.907 4.969

哦,等一下..这并不快。但是..索引..?!?我们每次都会用新值替换同一列。这会导致该列的顺序发生更改(从而删除索引)。让我们对 y 进行子集化,但修改 v:

# clean session
require(data.table)
set.seed(1L)
DF = data.table(x=rep(c("a","b","c"),each=40000000), y=sample(c(1,3,6),40000000,T), v=1:9)

system.time(DF[y==6, v := 10L])
# user system elapsed
# 4.653 1.071 5.765
system.time(DF[y==6, v := 10L])
# user system elapsed
# 0.685 0.213 0.910

options(datatable.verbose=TRUE)
system.time(DF[y==6, v := 10L])
# Using existing index 'y'
# Starting bmerge ...done in 0 secs
# Detected that j uses these columns: v
# Assigning to 40000059 row subset of 120000000 rows
# user system elapsed
# 0.683 0.221 0.914

您可以看到计算索引(使用二分搜索)的时间为 0 秒。另请检查 ?set2key()

如果您不打算重复进行子集化,或者像您的情况一样,对同一列进行子集化和修改,那么通过执行 options(datatable.auto.index = FALSE) 禁用该功能是有意义的,提交#1264 :

# clean session
require(data.table)
options(datatable.auto.index = FALSE) # disable auto indexing
set.seed(1L)
DF = data.table(x=rep(c("a","b","c"),each=40000000), y=sample(c(1,3,6),40000000,T), v=1:9)

system.time(DF[y==6, v := 10L])
# user system elapsed
# 1.067 0.274 1.367
system.time(DF[y==6, v := 10L])
# user system elapsed
# 1.100 0.314 1.443

这里差别不大。矢量扫描的时间为 system.time(DF$y == 6) = 0.448s

总而言之,就您的情况而言,矢量扫描更有意义。但总的来说,我们的想法是,最好支付一次惩罚,并在该列的 future 子集上快速获得结果,而不是每次都进行矢量扫描。

Auto indexing feature is relatively new, and will be extended over time, and probably optimised (perhaps there are places we've not looked at). While answering this Q, I realised that we don't show the time to compute the sort order (using fsort(), and I guess the time spent there might be the reason the timings are quite close, filed #1265).

<小时/>

至于你的第二种情况很慢,不太清楚为什么。我怀疑这可能是由于 R 部分的不必要的副本所致。您使用什么版本的 R?为了将来,请始终发布您的 sessionInfo() 输出。

关于r - 在data.table中有条件替换数据值的最快方法(速度比较),我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/31989067/

25 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com