gpt4 book ai didi

r - 变长 df 子采样函数 r

转载 作者:行者123 更新时间:2023-12-04 10:21:20 25 4
gpt4 key购买 nike

我需要编写一个函数,涉及通过变量 n 对 df 进行子集化垃圾箱。就像,如果 n是 2,然后在两个 bin 中对 df 进行多次子采样(从前半部分开始,然后从后半部分)。如 n是 3,3 个 bin 中的子样本(第一个 1/3,第二个 1/3,第三个 1/3)。到目前为止,我一直在为不同长度的 n 手动执行此操作,并且我知道必须有更好的方法来做到这一点。我想用 n 把它写成一个函数作为输入,但到目前为止我无法使其工作。代码如下。

# create df
df <- data.frame(year = c(1:46),
sample = seq(from=10,to=30,length.out = 46) + rnorm(46,mean=0,sd=2) )
# real df has some NAs, so we'll add some here
df[c(20,32),2] <- NA

这个df是46年的采样。我想假装不是 46 个样本,我只取了 2 个,但是在上半年的一个随机年份 (1:23) 和下半年的一个随机年份 (24:46)。
# to subset in 2 groups, say, 200 times
# I'll make a df of elements to sample
samplelist <- data.frame(firstsample = sample(1:(nrow(df)/2),200,replace = T), # first sample in first half of vector
secondsample = sample((nrow(df)/2):nrow(df),200, replace = T) )# second sample in second half of vector
samplelist <- as.matrix(samplelist)


# start a df to add to
plot_df <- df %>% mutate(first='all',
second = 'all',
group='full')

# fill the df using coords from expand.grid
for(i in 1:nrow(samplelist)){

plot_df <<- rbind(plot_df,
df[samplelist[i,] , ] %>%
mutate(
first = samplelist[i,1],
second = samplelist[i,2],
group = i
))
print(i)
}

(如果我们可以让它跳过“NA”样本年的样本,那就更好了)。

所以,如果我想用三分而不是两分来做这件事,我会像这样重复这个过程:
# to subset in 3 groups 200 times
# I'll make a df of elements to sample
samplelist <- data.frame(firstsample = sample(1:(nrow(df)/3),200,replace = T), # first sample in first 1/3
secondsample = sample(round(nrow(df)/3):round(nrow(df)*(2/3)),200, replace = T), # second sample in second 1/3
thirdsample = sample(round(nrow(df)*(2/3)):nrow(df), 200, replace=T) # third sample in last 1/3
)
samplelist <- as.matrix(samplelist)

# start a df to add to
plot_df <- df %>% mutate(first='all',
second = 'all',
third = 'all',
group='full')

# fill the df using coords from expand.grid
for(i in 1:nrow(samplelist)){

plot_df <<- rbind(plot_df,
df[samplelist[i,] , ] %>%
mutate(
first = samplelist[i,1],
second = samplelist[i,2],
third = samplelist[i,3],
group = i
))
print(i)
}

但是,我想这样做很多次,最多采样 20 次(所以在 20 个 bin 中),所以这种手动方法是不可持续的。你能帮我写一个函数说“从 n 个 bin 中挑选一个样本 x 次”吗?

顺便说一句,这是我用完整的 df 制作的情节:
plot_df %>%
ggplot(aes(x=year,y=sample)) +

geom_point(color="grey40") +

stat_smooth(geom="line",
method = "lm",
alpha=.3,
aes(color=group,
group=group),
se=F,
show.legend = F) +
geom_line(color="grey40") +


geom_smooth(data = plot_df %>% filter(group %in% c("full")),
method = "lm",
alpha=.7,
color="black",
size=2,
#se=F,
# fill="grey40
show.legend = F
) +
theme_classic()

最佳答案

如果我猜对了,以下函数将您的 df 拆分为 n 个 bin,从每个 bin 中抽取 x 个样本并将结果放回 df 的 cols 中:

library(tidyverse)

set.seed(42)

df <- data.frame(year = c(1:46),
sample = seq(from=10,to=30,length.out = 46) + rnorm(46,mean=0,sd=2) )

get_df_sample <- function(df, n, x) {
df %>%
# bin df in n bins of (approx.) equal length
mutate(bin = ggplot2::cut_number(seq_len(nrow(.)), n, labels = seq_len(n))) %>%
# split by bin
split(.$bin) %>%
# sample x times from each bin
map(~ .x[sample(seq_len(nrow(.x)), x, replace = TRUE),]) %>%
# keep only column "sample"
map(~ select(.x, sample)) %>%
# Rename: Add number of df-bin from which sample is drawn
imap(~ rename(.x, !!sym(paste0("sample_", .y)) := sample)) %>%
# bind
bind_cols() %>%
# Add group = rownames
rownames_to_column(var = "group")
}
get_df_sample(df, 3, 200) %>%
head()
#> sample_1 sample_2 sample_3 group
#> 1 12.58631 18.27561 24.74263 1
#> 2 19.46218 24.24423 23.44881 2
#> 3 12.92179 18.47367 27.40558 3
#> 4 15.22020 18.47367 26.29243 4
#> 5 12.58631 24.24423 24.43108 5
#> 6 19.46218 23.36464 27.40558 6

创建于 2020-03-24 由 reprex package (v0.3.0)

关于r - 变长 df 子采样函数 r,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/60829944/

25 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com