gpt4 book ai didi

r - dplyr::slice in data.table

转载 作者:行者123 更新时间:2023-12-04 18:57:01 25 4
gpt4 key购买 nike

这个问题在这里已经有了答案:





How to extract the first n rows per group?

(3 个回答)



Subset rows corresponding to max value by group using data.table

(1 个回答)


3年前关闭。




data.table中执行以下操作的惯用方法是什么? ?

library(dplyr)
df %>%
group_by(b) %>%
slice(1:10)

我可以
library(data.table)
df[, .SD[1:10]
, by = b]

但这似乎要慢得多。有没有更好的办法?
set.seed(0)
df <- rep(1:500, sample(500:1000, 500, T)) %>%
data.table(a = runif(length(.))
,b = .)

f1 <- function(df){
df %>%
group_by(b) %>%
slice(1:10)
}
f2 <- function(df){
df[, .SD[1:10]
, by = b]
}

library(microbenchmark)
microbenchmark(f1(df), f2(df))
#Unit: milliseconds
# expr min lq mean median uq max neval
# f1(df) 17.67435 19.50381 22.06026 20.50166 21.42668 78.3318 100
# f2(df) 69.69554 79.43387 119.67845 88.25585 106.38661 581.3067 100

========== 带有建议方法的基准测试 ==========
set.seed(0)
df <- rep(1:500, sample(500:1000, 500, T)) %>%
data.table(a = runif(length(.))
,b = .)

use.slice <- function(df){
df %>%
group_by(b) %>%
slice(1:10)
}
IndexSD <- function(df){
df[, .SD[1:10]
, by = b]
}
Index.I <- function(df) {
df[df[, .I[seq_len(10)], by = b]$V1]
}
use.head <- function(df){
df[, head(.SD, 10)
, by = b]
}

library(microbenchmark)
microbenchmark(use.slice(df)
, IndexSD(df)
, Index.I(df)
, use.head(df)
, unit = "relative"
, times = 100L)

#Unit: relative
# expr min lq mean median uq max neval
# use.slice(df) 9.804549 10.269234 9.167413 8.900060 8.782862 6.520270 100
# IndexSD(df) 38.881793 42.548555 39.044095 38.636523 39.942621 18.981748 100
# Index.I(df) 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 100
# use.head(df) 3.666898 4.033038 3.728299 3.408249 3.545258 3.951565 100

最佳答案

我们可以使用 .I提取行索引,应该更快

out <- df[df[, .I[seq_len(10)], by = b]$V1]
dim(out)
#[1] 5000 2

检查是否有 NA(如 OP 所评论)
any(out[, Reduce(`|`, lapply(.SD, is.na))])
#[1] FALSE


dim(df)
#[1] 374337 2

基准
f3 <- function(df) {
df[df[, .I[seq_len(10)], by = b]$V1]
}

microbenchmark(f1(df), f2(df), f3(df), unit = "relative", times = 10L)
#Unit: relative
# expr min lq mean median uq max neval cld
# f1(df) 5.727822 5.480741 4.945486 5.672206 4.317531 5.10003 10 b
# f2(df) 24.572633 23.774534 17.842622 23.070634 16.099822 11.58287 10 c
# f3(df) 1.000000 1.000000 1.000000 1.000000 1.000000 1.00000 10 a

关于r - dplyr::slice in data.table,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/50093919/

25 4 0