gpt4 book ai didi

r - 更快的上次观察结转 (LOCF)

转载 作者:行者123 更新时间:2023-12-02 01:14:27 24 4
gpt4 key购买 nike

我最近需要按 id 跨时间向前和向后分布 12 个时不变变量的值。我的数据集包含 2,448,638 个观测值和 57 个变量。

这是一个可重现的讨论示例:

# Load packages
library(tidyverse)
library(zoo)
library(lubridate)
library(tidyr)

# Reproducable example
set.seed(2017)
df <- tibble(
id = integer(15),
days = integer(15),
race = character(15),
language = character(15)
) %>%

mutate(
id = rep(1:3, each = 5)
) %>%

group_by(id) %>%

mutate(
days = as.integer(c(rnorm(2, -30, 15), 0, rnorm(2, 200, 100))),
race = if_else(days == 0, sample(c("W", "AA", "A", "O"), 1, replace = TRUE), NA_character_),
language = if_else(days == 0, sample(c("English", "Spanish", "Other"), 1, replace = TRUE), NA_character_)
) %>%

arrange(id, days)

df

id days race language
<int> <int> <chr> <chr>
1 1 -31 <NA> <NA>
2 1 -8 <NA> <NA>
3 1 0 W English
4 1 24 <NA> <NA>
5 1 273 <NA> <NA>
6 2 -31 <NA> <NA>
7 2 -23 <NA> <NA>
8 2 0 O English
9 2 4 <NA> <NA>
10 2 199 <NA> <NA>
11 3 -33 <NA> <NA>
12 3 -6 <NA> <NA>
13 3 0 A English
14 3 234 <NA> <NA>
15 3 357 <NA> <NA>

我想出了几种方法来获得我想要的结果:

使用动物园::na.locf

time_invariant <- c("race", "language")

df2 <- df %>%
group_by(id) %>%
mutate_at(.vars = time_invariant, .funs = na.locf, na.rm = FALSE) %>%
arrange(id, desc(days)) %>%
mutate_at(.vars = time_invariant, .funs = na.locf, na.rm = FALSE) %>%
arrange(id, days)

在使用我的 2016 MB Pro 的可重现示例中,需要 0.066293 秒才能完成。

我也试过 tidyr::fill

df2 <- df %>% 
group_by(id) %>%
fill_(fill_cols = time_invariant) %>%
fill_(fill_cols = time_invariant, .direction = "up")

在使用我的 2016 MB Pro 的可重现示例中需要 0.04381​​585 秒才能完成。

然而,根据我的真实数据,zoo::na.locf 方法花费了 3.172092 分钟,而 tidyr::fill 方法花费了 5.523152 分钟。这些时间并不可怕,但我确实注意到它们比 Stata 慢得多(在我运行 Stata 14.2 的 2016 MB Pro 上为 9.9060 秒)。这种速度差异促使我想看看是否有人知道更快的方法。

最佳答案

很确定这可以通过专家更快地完成:

df <- structure(list(id = c(1L, 1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L, 2L, 
3L, 3L, 3L, 3L, 3L), days = c(-31L, -8L, 0L, 24L, 273L, -31L,
-23L, 0L, 4L, 199L, -33L, -6L, 0L, 234L, 357L), race = c(NA,
NA, "W", NA, NA, NA, NA, "O", NA, NA, NA, NA, "A", NA, NA), language = c(NA,
NA, "English", NA, NA, NA, NA, "English", NA, NA, NA, NA, "English",
NA, NA)), class = "data.frame", row.names = c(NA, -15L), .Names = c("id",
"days", "race", "language"))

library(dplyr)
library(zoo)
library(tidyr)
time_invariant <- c("race", "language")
dplyrzoo <- function() {
df2 <- df %>%
group_by(id) %>%
mutate_at(.cols = time_invariant, .funs = na.locf, na.rm = FALSE) %>%
arrange(id, desc(days)) %>%
mutate_at(.cols = time_invariant, .funs = na.locf, na.rm = FALSE) %>%
arrange(id, days)
}

dplyrfill <- function() {
df2 <- df %>%
group_by(id) %>%
fill_(fill_cols = time_invariant) %>%
fill_(fill_cols = time_invariant, .direction = "up")
}

library(data.table)
dtstyle <- function() {
dt <- data.table(df)
cols <- c("race", "language")
dt[, (cols) := lapply(.SD, function(x) na.omit(x)[1]), .SDcols=cols, by =.(id)]
dt
}

#check results
all.equal(as.data.frame(dplyrzoo()), as.data.frame(dplyrfill()))
all.equal(as.data.frame(dtstyle()), as.data.frame(dplyrfill()))

#timings
library(microbenchmark)
timings <- capture.output(microbenchmark(dplyrzoo=dplyrzoo(),
dplyrfill=dplyrfill(),
dtstyle=dtstyle(),
times=100L))
writeLines(paste("#", timings))

# Unit: milliseconds
# expr min lq mean median uq max neval
# dplyrzoo 6.7952 7.01815 7.399851 7.18815 7.53685 10.8360 100
# dplyrfill 4.7458 5.02865 5.319848 5.16990 5.34750 7.8329 100
# dtstyle 1.3598 1.54025 1.692119 1.65420 1.73280 4.0413 100

关于r - 更快的上次观察结转 (LOCF),我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/43335879/

24 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com