r - R中的高效随机抽样-6ren

r - R中的高效随机抽样

转载作者：行者123 更新时间：2023-12-01 00:12:03

24

4

从数据框中，我尝试随机抽样 1:20 的观察结果，其中每个观察次数我想将过程复制 4 次。我想出了这个可行的解决方案，但它很慢，因为它是由于 crossing()，涉及多次处理大型数据帧功能。任何人都可以指出我更有效的解决方案吗？

library(tidyverse)

mtcars %>% 
  group_by(cyl) %>% 
  nest() %>% 
  crossing(n_random_sample = 1:20, n_replicate = 1:4) %>% 
  mutate(res = map2_dbl(data, n_random_sample, function(data, n) {

    data %>%
      sample_n(n, replace = TRUE) %>%
      summarise(mean_mpg = mean(mpg)) %>%
      pull(mean_mpg)

  }))
#> # A tibble: 240 x 5
#>      cyl data              n_random_sample n_replicate   res
#>    <dbl> <list>                      <int>       <int> <dbl>
#>  1     6 <tibble [7 × 10]>               1           1  17.8
#>  2     6 <tibble [7 × 10]>               1           2  21  
#>  3     6 <tibble [7 × 10]>               1           3  19.2
#>  4     6 <tibble [7 × 10]>               1           4  18.1
#>  5     6 <tibble [7 × 10]>               2           1  19.6
#>  6     6 <tibble [7 × 10]>               2           2  19.4
#>  7     6 <tibble [7 × 10]>               2           3  19.6
#>  8     6 <tibble [7 × 10]>               2           4  20.4
#>  9     6 <tibble [7 × 10]>               3           1  20.1
#> 10     6 <tibble [7 × 10]>               3           2  18.9
#> # ... with 230 more rows

^{由 reprex package 创建于 2018-11-19 (v0.2.1)}

编辑:我现在正在处理一个更大的数据集。是否可以使用 data.table 更有效地做到这一点？

最佳答案

这是一个替代解决方案，它对原始数据集进行子集化并使用函数选择行样本，而不是使用 nest 创建子数据集并将它们存储为列表变量，然后使用 map 选择一个示例:

library(tidyverse)

# create function to sample rows
f = function(c, n) {
  mtcars %>%
    filter(cyl == c) %>%
    sample_n(n, replace = TRUE) %>%
    summarise(mean_mpg = mean(mpg)) %>%
    pull(mean_mpg)
}

# vectorise function
f = Vectorize(f)

# set seed for reproducibility
set.seed(11)

tbl_df(mtcars) %>%
  distinct(cyl) %>%
  crossing(n_random_sample = 1:20, n_replicate = 1:4) %>%
  mutate(res = f(cyl, n_random_sample))

# # A tibble: 240 x 4
#     cyl n_random_sample n_replicate   res
#   <dbl>           <int>       <int> <dbl>
# 1     6               1           1  21  
# 2     6               1           2  21  
# 3     6               1           3  18.1
# 4     6               1           4  21  
# 5     6               2           1  20.4
# 6     6               2           2  21.2
# 7     6               2           3  20.4
# 8     6               2           4  19.6
# 9     6               3           1  18.4
#10     6               3           2  19.6
# # ... with 230 more rows

关于r - R中的高效随机抽样，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/53376574/

24

4

0

文章推荐： javascript - 从数组 MongoDB 中删除对象

文章推荐： jQuery 用新数据替换表行

文章推荐： python - 如何根据条件将数据帧的所有列相乘？

随机抽样 - 矩阵
如何从填充有 1 和 0 的矩阵中抽取 n 个随机点的样本？ a=rep(0:1,5) b=rep(0,10) c=rep(1,10) dataset=matrix(cbind(a,b,c),nrow
python - 3d 随机抽样
这个问题在这里已经有了答案: How to efficiently get 10% of random numbers, then 10% of remaining 90 etc untill al
Python - 每组 Pandas 随机抽样
我有一个与它非常相似的数据框，但有数千个值: import numpy as np import pandas as pd # Setup fake data. np.random.seed([3,

首页

博学

6Ren·AI

商城

r - R中的高效随机抽样