gpt4 book ai didi

R:删除嵌套的 for 循环以使自定义 Bootstrap 更有效

转载 作者:行者123 更新时间:2023-12-04 15:30:00 25 4
gpt4 key购买 nike

我正在尝试从数据集中收集一些汇总统计的自举估计,但我想以不同的速率重新采样数据集的部分,这导致我依赖嵌套的 for 循环。

具体来说,假设我的数据集中有两组,每组进一步分为测试和控制。第 1 组具有 75%/25% 的测试控制比率,而第 2 组具有 50%/50% 的测试控制比率。

我想重新采样以使数据集大小相同,但两组的测试控制比率均为 90%/10%……换句话说,以不同的比率重新采样不同的子组,这让我觉得与boot包通常可以。

在我的数据集中,我创建了一个 group代表组的变量,以及 groupT代表与测试/控制连接的组的变量,例如:

    id     group     groupT
1 1 1T
2 1 1T
3 2 2T
4 1 1C
5 2 2C

这是我现在正在运行的内容,使用 nreps任意设置为我的引导复制次数:
for (j in 1:nreps){

bootdat <- datafile[-(1:nrow(datafile)),] ## initialize empty dataset

for (i in unique(datafile$groups)){

tstring<-paste0(i,"T") ## e.g. 1T
cstring<-paste0(i,"C") ## e.g. 1C

## Size of test group resample should be ~90% of total group size

tsize<-round(.90*length(which(datafile$groups==i)),0)

## Size of control group resample should be total group size minus test group size

csize<-length(which(datafile$groups==i))-tsize

## Continue building bootdat by rbinding the test and control resample

## before moving on to the next group
## Note the use of datafile$groupT==tstring to ensure I'm only sampling from test, etc.

bootdat<-rbind(bootdat,datafile[sample(which(datafile$groupT==tstring),size=tsize,
replace=TRUE),])

bootdat<-rbind(bootdat,datafile[sample(which(datafile$groupT==cstring),size=csize,
replace=TRUE),])
}

## Here, there is code to grab some summary statistics from bootdat
## and store them in statVector[j] before moving on to the next replication
}

对于大约 100 万条记录的数据集,每次复制需要 3-4 分钟。我确信有更好的方法可以使用 sapply 来做到这一点。或者可能是一些 dplyr 函数,但到目前为止我的尝试都是空的。任何帮助,将不胜感激!

最佳答案

我强烈建议您查看 data.table 和 foreach,使用键控搜索 bootstrap 。它将允许您非常快速地执行单个 bootstrap ,并且您可以在不同的内核上独立运行每个 bootstrap 。下面的每个 bootstrap 在我的机器上需要 0.5 秒,搜索 100 万行的表。类似以下的内容应该可以帮助您入门:

library(data.table)
library(foreach)
library(doMC)
registerDoMC(cores=4)

# example data
dat <- data.table(id=1:1e6, group=sample(2, size=1e6, replace=TRUE), test_control=sample(c("T","C"), size=1e5, replace=TRUE))


# define number of bootstraps
nBootstraps <- 1000

# define sampling fractions
fraction_test <- 0.90
fraction_control <- 1 - fraction_test

# get number that you want to sample from each group
N.test <- round(fraction_test * dim(dat)[1])
N.control <- round(fraction_control * dim(dat)[1])

# key data by id
setkey(dat, id)

# get ID values for each combination, to be used for keyed search during bootstrapping
group1_test_ids <- dat[group==1 & test_control=="T"]$id
group1_control_ids <- dat[group==1 & test_control=="C"]$id
group2_test_ids <- dat[group==2 & test_control=="T"]$id
group2_control_ids <- dat[group==2 & test_control=="C"]$id


results <- foreach(n = 1:nBootstraps, .combine="rbind", .inorder=FALSE) %dopar% {

# sample each group with the defined sizes, with replacement
g1T <- dat[.(sample(group1_test_ids, size=N.test, replace=TRUE))]
g1C <- dat[.(sample(group1_control_ids, size=N.control, replace=TRUE))]
g2T <- dat[.(sample(group2_test_ids, size=N.test, replace=TRUE))]
g2C <- dat[.(sample(group2_control_ids, size=N.control, replace=TRUE))]
dat.all <- rbindlist(list(g1T, g1C, g2T, g2C))
dat.all[, bootstrap := n]

# do summary stats here with dat.all, return the summary stats data.table object
return(dat.summarized)

}

编辑:下面的示例包括任意数量的唯一组中的每一个的查找表。为简单起见,可以在 foreach 循环中引用与组 +(测试或控制)的每个组合对应的 ID。 N.test 和 N.control 的数字较低(900 和 100),它吐出 1000 个 bootstrap 的结果
library(data.table)
library(foreach)

# example data
dat <- data.table(id=1:1e6, group=sample(24, size=1e6, replace=TRUE), test_control=sample(c("T","C"), size=1e5, replace=TRUE))

# save vector of all group values & change group to character vector for hashed environment lookup
all_groups <- as.character(sort(unique(dat$group)))
dat[, group := as.character(group)]


# define number of bootstraps
nBootstraps <- 100

# get number that you want to sample from each group
N.test <- 900
N.control <- 100

# key data by id
setkey(dat, id)

# all values for group

# Set up lookup table for every combination of group + test/control
control.ids <- new.env()
test.ids <- new.env()

for(i in all_groups) {
control.ids[[i]] <- dat[group==i & test_control=="C"]$id
test.ids[[i]] <- dat[group==i & test_control=="T"]$id
}


results <- foreach(n = 1:nBootstraps, .combine="rbind", .inorder=FALSE) %do% {
foreach(group.i = all_groups, .combine="rbind") %do% {

# get IDs that correspond to this group, for both test and control
control_id_vector <- control.ids[[group.i]]
test_id_vector <- test.ids[[group.i]]

# search and bind
controls <- dat[.(sample(control_id_vector, size=N.control, replace=TRUE))]
tests <- dat[.(sample(test_id_vector, size=N.test, replace=TRUE))]
dat.group <- rbindlist(list(controls, tests))
dat.group[, bootstrap := n]
return(dat.group[])
}
# summarize across all groups for this bootstrap and return summary stat data.table object

}

屈服
> results
id group test_control bootstrap
1: 701570 1 C 1
2: 424018 1 C 1
3: 909932 1 C 1
4: 15354 1 C 1
5: 514882 1 C 1
---
23999996: 898651 24 T 1000
23999997: 482374 24 T 1000
23999998: 845577 24 T 1000
23999999: 862359 24 T 1000
24000000: 602078 24 T 1000

这不涉及任何汇总统计计算时间,但这里 1000 个 bootstrap 在 1 个内核上连续拉出
   user  system elapsed
62.574 1.267 63.844

如果您需要手动编码 N 为每个组不同,您可以执行与 id 查找相同的操作
# create environments
control.Ns <- new.env()
test.Ns <- new.env()

# assign size values
control.Ns[["1"]] <- 900
test.Ns[["1"]] <- 100
control.Ns[["2"]] <- 400
test.Ns[["2"]] <- 50
... ...
control.Ns[["24"]] <- 200
test.Ns[["24"]] <- 5

然后更改大引导循环以根据循环的当前组查找这些值:
results <- foreach(n = 1:nBootstraps, .combine="rbind", .inorder=FALSE) %do% {
foreach(group.i = all_groups, .combine="rbind") %do% {

# get IDs that correspond to this group, for both test and control
control_id_vector <- control.ids[[group.i]]
test_id_vector <- test.ids[[group.i]]

# get size values
N.control <- control.Ns[[group.i]]
N.test <- test.Ns[[group.i]]

# search and bind
controls <- dat[.(sample(control_id_vector, size=N.control, replace=TRUE))]
tests <- dat[.(sample(test_id_vector, size=N.test, replace=TRUE))]
dat.group <- rbindlist(list(controls, tests))
dat.group[, bootstrap := n]
return(dat.group[])
}
# summarize across all groups for this bootstrap and return summary stat data.table object

}

关于R:删除嵌套的 for 循环以使自定义 Bootstrap 更有效,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/49277633/

25 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com