gpt4 book ai didi

r - 在 R 中使用复杂的调查设计提高查询速度

转载 作者:行者123 更新时间:2023-12-04 18:37:25 42 4
gpt4 key购买 nike

我有一个大型数据集(超过 2000 万个 obs),我用 survey 进行分析。包,运行简单的查询需要我很长时间。我试图找到一种方法来加速我的代码,但我想知道是否有更好的方法来提高效率。
在我的基准测试中,我使用 svyby 比较了三个命令的速度。/svytotal :

  • 简单命令 svyby/svytotal
  • 并行计算 foreach dopar使用 7 核
  • 选项 2 的编译版本

  • 剧透:选项 3 的速度是第一个选项的两倍多,但它不适合大型数据集,因为它依赖于并行计算,在处理大型数据集时会很快达到内存限制。尽管我有 16GB 的内存,但我也面临这个问题。有几个 solutions to this memory limitation ,但它们都不适用于测量设计对象。
    关于如何使其更快而不因内存限制而崩溃的任何想法?
    我的代码带有可重现的示例:
    # Load Packages
    library(survey)
    library(data.table)
    library(compiler)
    library(foreach)
    library(doParallel)
    options(digits=3)

    # Load Data
    data(api)

    # Convert data to data.table format (mostly to increase speed of the process)
    apiclus1 <- as.data.table(apiclus1)

    # Multiplicate data observations by 1000
    apiclus1 <- apiclus1[rep(seq_len(nrow(apiclus1)), 1000), ]

    # create a count variable
    apiclus1[, Vcount := 1]

    # create survey design
    dclus1 <- svydesign(id=~dnum, weights=~pw, data=apiclus1, fpc=~fpc)
    1)简单的代码
    t1 <- Sys.time()
    table1 <- svyby(~Vcount,
    ~stype+dnum+cname,
    design = dclus1,
    svytotal)
    T1 <- Sys.time() - t1
    2) 使用 7 个内核的 foreach dopar 并行计算
    # in this option, I create a list with different subsets of the survey design
    # that will be passed to different CPU cores to work at the same time

    subdesign <- function(i){ subset(dclus1, dnum==i)}
    groups <- unique(apiclus1$dnum)
    list_subsets <- lapply(groups[], subdesign) # apply function and get all subsets in a list
    i <- NULL

    # Start Parallel
    registerDoParallel(cores=7)

    t2 <- Sys.time()
    table2 <- foreach (i = list_subsets, .combine= rbind, .packages="survey") %dopar% {
    options( survey.lonely.psu = "remove" )
    svyby(~Vcount,
    ~stype+dnum+cname,
    design = i,
    svytotal)}
    T2 <- Sys.time() - t2
    3.选项2的编译版本
    # make a function of the previous query
    query2 <- function (list_subsets) { foreach (i = list_subsets, .combine= rbind, .packages="survey") %dopar% {
    svyby(~Vcount,
    ~stype+dnum+cname,
    design = i,
    svytotal)}}

    # Compile the function to increase speed
    query3 <- cmpfun(query2 )

    t3 <- Sys.time()
    table3 <- query3 (list_subsets)
    T3 <- Sys.time() - t3
    结果
    >T1: 1.9 secs
    >T2: 1.13 secs
    >T3 0.58 secs

    barplot(c(T1, T2, T3),
    names.arg = c("1) simple table", "2) parallel", "3) compiled parallel"),
    ylab="Seconds")
    enter image description here

    最佳答案

    谢谢你把这个问题问得这么好。在 R 中有效地处理大型调查数据集可能需要一些基本的 SQL 语法(这比 R 更容易学习)。 MonetDB 是唯一与 survey 兼容的大数据选项包,探索其他高性能包(可能)不会有成果。通常,当我探索一个巨大的数据集时,我直接在 SQL 查询中编写而不是使用调查包,因为标准错误计算是计算密集型的(并且在交互式数据探索期间方差不是那么有用)。请注意最终的 SQL 时间戳如何消除所有其他选项。要计算快速加权平均值,请使用类似 "SELECT by_column , SUM( your_column * the_weight ) / SUM( the_weight ) FROM yourdata GROUP BY by_column" 的内容

    当您确实需要交互式标准误差时,线性化 ( svydesign ) 通常比复制 ( svrepdesign ) 需要更多的计算,但有时创建复制设计(就像我对下面的 jk1w_dclus1 所做的那样)需要更多的调查方法比某些用户感到舒服的熟悉程度。

    # Load Packages
    library(MonetDB.R)
    library(MonetDBLite)
    library(DBI) # suggested in comments and needed on OSX
    library(survey)

    # Load Data
    data(api)

    # Multiplicate data observations by 10000
    apiclus1 <- apiclus1[rep(seq_len(nrow(apiclus1)), 10000), ]

    # create a count variable
    apiclus1$vcount <- 1

    # create survey design
    dclus1 <- svydesign(id=~dnum, weights=~pw, data=apiclus1, fpc=~fpc)


    dbfolder <- tempdir()

    db <- dbConnect( MonetDBLite() , dbfolder )
    dbWriteTable( db , 'apiclus1' , apiclus1 )


    db_dclus1 <-
    svydesign(
    weight = ~pw ,
    id = ~dnum ,
    data = "apiclus1" ,
    dbtype = "MonetDBLite" ,
    dbname = dbfolder ,
    fpc = ~fpc
    )

    # you provided a design without strata,
    # so type="JK1" matches that most closely.
    # but see survey:::as.svrepdesign for other linearization-to-replication options
    jk1w <- jk1weights( psu = apiclus1$dnum , fpc = apiclus1$fpc )

    # after the replicate-weights have been constructed,
    # here's the `svrepdesign` call..
    jk1w_dclus1 <-
    svrepdesign(
    weight = ~pw ,
    type = "JK1" ,
    repweights = jk1w$repweights ,
    combined.weights = FALSE ,
    scale = jk1w$scale ,
    rscales = jk1w$rscales ,
    data = 'apiclus1' ,
    dbtype = "MonetDBLite" ,
    dbname = dbfolder
    )

    # slow
    system.time(res1 <- svyby(~vcount,~stype+dnum+cname,design = dclus1,svytotal))
    # > system.time(res1 <- svyby(~vcount,~stype+dnum+cname,design = dclus1,svytotal))
    # user system elapsed
    # 17.40 2.86 20.27


    # faster
    system.time(res2 <- svyby(~vcount,~stype+dnum+cname,design = db_dclus1,svytotal))
    # > system.time(res2 <- svyby(~vcount,~stype+dnum+cname,design = db_dclus1,svytotal))
    # user system elapsed
    # 13.00 1.20 14.18


    # fastest
    system.time(res3 <- svyby(~vcount,~stype+dnum+cname,design = jk1w_dclus1,svytotal))
    # > system.time(res3 <- svyby(~vcount,~stype+dnum+cname,design = jk1w_dclus1,svytotal))
    # user system elapsed
    # 10.75 1.19 11.96

    # same standard errors across the board
    all.equal( SE( res1 ) , SE( res2 ) )
    all.equal( SE( res2 ) , SE( res3 ) )
    # NOTE: the replicate-weighted design will be slightly different
    # for certain designs. however this technique is defensible
    # and gets used in
    # https://github.com/ajdamico/asdfree/tree/master/Censo%20Demografico


    # at the point you do not care about standard errors,
    # learn some sql:
    system.time( res4 <- dbGetQuery( db , "SELECT stype , dnum , cname , SUM( pw ) FROM apiclus1 GROUP BY stype , dnum , cname" ) )
    # because this is near-instantaneous, no matter how much data you have.

    # same numbers as res1:
    all.equal( as.numeric( sort( coef( res1 ) ) ) , sort( res4$L1 ) )
    # > system.time( res4 <- dbGetQuery( db , "SELECT stype , dnum , cname , SUM( pw ) FROM apiclus1 GROUP BY stype , dnum , cname" ) )
    # user system elapsed
    # 0.15 0.20 0.23

    关于r - 在 R 中使用复杂的调查设计提高查询速度,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/32380663/

    42 4 0
    Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
    广告合作:1813099741@qq.com 6ren.com