gpt4 book ai didi

r - 使用 data.table 优化对一个变量的唯一值数量进行计数

转载 作者:行者123 更新时间:2023-12-03 17:12:14 24 4
gpt4 key购买 nike

我正在尝试查找由变量/键 y 定义的每个组中一个变量 x 的唯一值出现的次数。

我一直在使用以下代码:

 DT[,length(unique(x)),by=y] -> x_count_per_y

这可以工作,但有点慢。有没有办法优化 data.table,或者这是我期望的最快的?

最佳答案

使用 data.table 1.9.5 版本中的 uniqueN
在 1.9.4 中应该也可以使用

uniqueN <- function(x) length(attr(data.table:::forderv(x, retGrp=TRUE),"starts",TRUE))

以编程方式使用它

byvar = "y"
countvar = "x"
DT[, uniqueN(.SD), by=byvar, .SDcols=countvar]

具体时间安排如下:

library(data.table)
library(microbenchmark)
N <- 1e6
DT <- data.table(x = sample(1e5,N,TRUE), y = sample(1e2,N,TRUE))
microbenchmark(times=1L,
DT[, length(unique(x)),y],
DT[, uniqueN(x),y],
DT[, uniqueN(.SD), by="y", .SDcols="x"])
# Unit: milliseconds
# expr min lq mean median uq max neval
# DT[, length(unique(x)), y] 85.58602 85.58602 85.58602 85.58602 85.58602 85.58602 1
# DT[, uniqueN(x), y] 92.71877 92.71877 92.71877 92.71877 92.71877 92.71877 1
# DT[, uniqueN(.SD), by = "y", .SDcols = "x"] 97.51024 97.51024 97.51024 97.51024 97.51024 97.51024 1
N <- 1e7
DT <- data.table(x = sample(1e5,N,TRUE), y = sample(1e2,N,TRUE))
microbenchmark(times=1L,
DT[, length(unique(x)),y],
DT[, uniqueN(x),y],
DT[, uniqueN(.SD), by="y", .SDcols="x"])
# Unit: milliseconds
# expr min lq mean median uq max neval
# DT[, length(unique(x)), y] 1642.5212 1642.5212 1642.5212 1642.5212 1642.5212 1642.5212 1
# DT[, uniqueN(x), y] 843.0670 843.0670 843.0670 843.0670 843.0670 843.0670 1
# DT[, uniqueN(.SD), by = "y", .SDcols = "x"] 804.7881 804.7881 804.7881 804.7881 804.7881 804.7881 1
N <- 1e7
DT <- data.table(x = sample(1e6,N,TRUE), y = sample(1e5,N,TRUE))
microbenchmark(times=1L,
DT[, length(unique(x)),y],
DT[, uniqueN(x),y],
DT[, uniqueN(.SD), by="y", .SDcols="x"])
# Unit: seconds
# expr min lq mean median uq max neval
# DT[, length(unique(x)), y] 3.025365 3.025365 3.025365 3.025365 3.025365 3.025365 1
# DT[, uniqueN(x), y] 4.734323 4.734323 4.734323 4.734323 4.734323 4.734323 1
# DT[, uniqueN(.SD), by = "y", .SDcols = "x"] 5.905721 5.905721 5.905721 5.905721 5.905721 5.905721 1
N <- 1e7
DT <- data.table(x = sample(1e3,N,TRUE), y = sample(1e5,N,TRUE))
microbenchmark(times=1L,
DT[, length(unique(x)),y],
DT[, uniqueN(x),y],
DT[, uniqueN(.SD), by="y", .SDcols="x"])
# Unit: seconds
# expr min lq mean median uq max neval
# DT[, length(unique(x)), y] 2.906589 2.906589 2.906589 2.906589 2.906589 2.906589 1
# DT[, uniqueN(x), y] 4.731925 4.731925 4.731925 4.731925 4.731925 4.731925 1
# DT[, uniqueN(.SD), by = "y", .SDcols = "x"] 7.084020 7.084020 7.084020 7.084020 7.084020 7.084020 1
N <- 1e7
DT <- data.table(x = sample(1e6,N,TRUE), y = sample(1e2,N,TRUE))
microbenchmark(times=1L,
DT[, length(unique(x)),y],
DT[, uniqueN(x),y],
DT[, uniqueN(.SD), by="y", .SDcols="x"])
# Unit: milliseconds
# expr min lq mean median uq max neval
# DT[, length(unique(x)), y] 1331.244 1331.244 1331.244 1331.244 1331.244 1331.244 1
# DT[, uniqueN(x), y] 998.040 998.040 998.040 998.040 998.040 998.040 1
# DT[, uniqueN(.SD), by = "y", .SDcols = "x"] 1096.867 1096.867 1096.867 1096.867 1096.867 1096.867 1

很大程度上取决于数据,但我已经填写了一个问题来查看这些时间。还有一个角色:

N <- 1e7
DT <- data.table(x = sample(letters,N,TRUE), y = sample(letters[1:10],N,TRUE))
microbenchmark(times=1L,
DT[, length(unique(x)),y],
DT[, uniqueN(x),y],
DT[, uniqueN(.SD), by="y", .SDcols="x"])
# Unit: milliseconds
# expr min lq mean median uq max neval
# DT[, length(unique(x)), y] 1304.4865 1304.4865 1304.4865 1304.4865 1304.4865 1304.4865 1
# DT[, uniqueN(x), y] 573.8628 573.8628 573.8628 573.8628 573.8628 573.8628 1
# DT[, uniqueN(.SD), by = "y", .SDcols = "x"] 528.3269 528.3269 528.3269 528.3269 528.3269 528.3269 1

关于r - 使用 data.table 优化对一个变量的唯一值数量进行计数,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/29684036/

24 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com