gpt4 book ai didi

R为什么dplyr比data.table(uniqueN)更快地按组计算唯一值(n_distinct)?

转载 作者:行者123 更新时间:2023-12-04 10:26:17 34 4
gpt4 key购买 nike

据我了解,data.table 比 dplyr 更高效、更快,但我今天在工作中发现了相反的情况。我创建了一个模拟来解释这种情况。

library(data.table)
library(dplyr)

library(microbenchmark)

# data simulated
dt = data.table(A = sample(1:4247,10000, replace = T),
B = sample(1:119,10000,replace = T),
C = sample(1:6,10000,replace = T),
D = sample(1:30,10000,replace = T))

dt[,ID:=paste(A, ":::" ,
D,":::",
C)]
# execution time

microbenchmark(
DATA_TABLE = dt[, .(count=uniqueN(ID)),
by=c("A","B","C")
],
DPLYR = dt %>%
group_by(A,B,C) %>%
summarise(count = n_distinct(ID)),
times = 10
)

结果
Unit: milliseconds
expr min lq mean median uq max neval
DATA_TABLE 14241.57361 14305.67026 15585.80472 14651.16402 16244.22477 21367.56866 10
DPLYR 35.95123 37.63894 47.62637 48.56598 53.59919 62.63978 10

你可以看到很大的不同!有人知道原因吗?您对何时使用 dplyr 或 data.table 有什么建议吗?

我现在有了 data.table 语法的完整代码,由于这种情况,我不知道是否需要将一些代码块转换为 dplyr。

提前致谢。

最佳答案

这是另一种选择:

dt[order(A, B, C), {
uniqn <- rleidv(c(.SD, .(ID)))
lastidx <- c(which(diff(rowidv(.SD))<1L), .N)
c(.SD[lastidx], .(count=c(uniqn[lastidx[1L]], diff(uniqn[lastidx]))))
}, .SDcols=cols]

计时码:
cols <- c("A","B","C")
microbenchmark(times=1L,

DATA_TABLE = a00 <- dt[, .(count=uniqueN(ID)), cols],

DATA_TABLE1 = a01 <- dt[, .(count=length(unique(ID))), cols],

DPLYR = a_dplyr <- dt %>%
group_by(A,B,C) %>%
summarise(count = n_distinct(ID)),

#https://github.com/Rdatatable/data.table/issues/1120#issuecomment-463584656
mtd0 = a10 <- unique(dt, by=c(cols, "ID"))[, .(count=.N), cols],

#https://github.com/Rdatatable/data.table/issues/1120#issuecomment-463597107
mtd1 = a11 <- dt[, .N, c(cols, "ID")][, .(count=.N), cols],

mtd2 = a2 <- dt[order(A, B, C), {
uniqn <- rleidv(c(.SD, .(ID)))
lastidx <- c(which(diff(rowidv(.SD))<1L), .N)
c(.SD[lastidx], .(count=c(uniqn[lastidx[1L]], diff(uniqn[lastidx]))))
}, .SDcols=cols]
)

检查:
> fsetequal(a00, a01)
[1] TRUE

> fsetequal(a00, setDT(a_dplyr))
[1] TRUE

> fsetequal(a00, a10)
[1] TRUE

> fsetequal(a00, a11)
[1] TRUE

> fsetequal(a00, a2)
[1] TRUE

以下特定数据集的时间安排:
Unit: milliseconds
expr min lq mean median uq max neval
DATA_TABLE 147478.1089 147478.1089 147478.1089 147478.1089 147478.1089 147478.1089 1
DATA_TABLE1 4998.8236 4998.8236 4998.8236 4998.8236 4998.8236 4998.8236 1
DPLYR 244081.6925 244081.6925 244081.6925 244081.6925 244081.6925 244081.6925 1
mtd0 4519.4046 4519.4046 4519.4046 4519.4046 4519.4046 4519.4046 1
mtd1 2866.5808 2866.5808 2866.5808 2866.5808 2866.5808 2866.5808 1
mtd2 809.7442 809.7442 809.7442 809.7442 809.7442 809.7442 1

1mio 行的数据:
#R-3.6.1 64bit Win10
library(data.table) #data.table_1.12.8 getDTthreads()==4
library(dplyr) #dplyr_1.0.0
library(microbenchmark)

# data simulated
set.seed(0L)
nr <- 1e6
dt = data.table(A = sample(1:424700,nr, replace = T),
B = sample(1:11900,nr, replace = T),
C = sample(1:600, nr, replace = T),
D = sample(1:3000, nr, replace = T))
dt[,ID:=paste(A,":::",D,":::",C)]

关于R为什么dplyr比data.table(uniqueN)更快地按组计算唯一值(n_distinct)?,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/60623235/

34 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com