gpt4 book ai didi

r - 快速分组简单线性回归

转载 作者:行者123 更新时间:2023-12-01 18:13:22 25 4
gpt4 key购买 nike

此问答来自How to make group_by and lm fast?其中 OP 试图对大型数据框的每组进行简单的线性回归。

理论上,一系列 group-by 回归 y ~ x | g等效于单个池回归 y ~ x * g .后者非常吸引人,因为不同组之间的统计测试很简单。但在实践中,做这个更大的回归在计算上并不容易。我对链接的问答评论包的回答 speedlmglm4 ,但指出他们不能很好地解决这个问题。

大回归问题很困难,特别是当存在因子变量时。这或许可以解释为什么很多人放弃了这个想法,而更喜欢按组拆分数据并按组拟合模型。我没有必要列举分组回归的方法(参见 Linear Regression and group by in R)。我关心的是速度。

对于简单的线性回归,y ~ x | g ,按组拆分数据,然后依赖标准模型拟合例程,如 lm是性能杀手。首先,对大型数据框进行子集化是低效的。其次,标准模型拟合例程遵循以下过程,这对于有用的回归计算来说是纯粹的开销。

  • 将模型公式解析为“术语”对象(使用 terms.formula );
  • 构建模型框架(使用 model.frame.default );
  • 构建模型矩阵(使用 model.matrix.default )。

  • 简单的线性回归有巧妙的计算技巧。正如我在 Fast pairwise simple linear regression between variables in a data frame 中演示的那样,协方差方法非常快。我们能否通过 group_by_simpleLM 将其调整为分组简单线性回归?功能?

    最佳答案

    我们必须通过编写编译代码来做到这一点。我会用 Rcpp 来介绍这个。请注意,我是一名 C 程序员,并且一直在使用 R 的传统 C 接口(interface)。 Rcpp 仅用于简化列表、字符串和属性的处理,以及便于在 R 中进行即时测试。代码大部分是用 C 风格编写的。来自 R 的传统 C 接口(interface)的宏,如 REALINTEGER仍在使用。请参阅此答案的底部以获取“group_by_simpleLM.cpp”。

    R 包装函数 group_by_simpleLM有四个参数:

    group_by_simpleLM <- function (dat, LHS, RHS, group) {
    ##.... [TRUNCATED]

  • dat是一个数据框。如果你输入一个矩阵或一个列表,它会停下来提示。
  • LHS是一个字符向量,为 ~ 左侧的变量提供名称.支持多个 LHS 变量。
  • RHS是一个字符向量,为 ~ 右侧的变量提供名称.在简单线性回归中只允许使用单个非因子 RHS 变量。您可以向 RHS 提供变量向量,但该函数只会保留第一个元素(带有警告)。如果在 dat 中找不到该变量(可能是因为你打错了名字)或者它不是一个数字变量,它会给你一个信息性的错误消息。
  • group是一个字符向量,给出分组变量的名称。它最好是 dat 中的一个因素, 否则函数使用 match(group, unique(group))快速强制并发出警告。具有未使用水平的因素没有害处。 group_by_simpleLM_cpp看到这个并全部返回NaN对于这样的水平。 group可以是NULL以便对所有数据进行一次回归。

  • 主力功能 group_by_simpleLM_cpp返回一个命名的矩阵列表来保存每个响应的回归结果。每个矩阵都是“宽”的 nlevels(group)列和 5 行:
  • “阿尔法”(拦截);
  • “beta”(用于斜率);
  • “beta.se”(用于斜率的标准误差);
  • “r2”(用于 R 平方);
  • “df.resid”(剩余自由度);

  • For a simple linear regression, these five statistics are sufficient to obtain other statistics .

    该函数注意组中只有一个数据的秩不足的情况。坡度无法估计, NaN被退回。另一种特殊情况是当一个组只有两个数据时。然后拟合完美,斜率的标准误差为 0。

    该函数是 nlme::lmList(RHS ~ LHS | group, dat, pool = FALSE) 的快速方法当 group != NULL ,以及 lm(RHS ~ LHS, dat) 的快速方法当 group = NULL (甚至可能比 general_paired_simpleLM 更快,因为它是用 C 编写的)。

    注意:
  • 不处理加权回归,因为在这种情况下协方差方法无效。
  • 没有检查 NA/NaN/Inf/-Infdat被制造出来,功能在它们的存在下被打破。


  • 例子
    library(Rcpp)
    sourceCpp("group_by_simpleLM.cpp")

    ## a toy dataset
    set.seed(0)
    dat <- data.frame(y1 = rnorm(10), y2 = rnorm(10), x = 1:5,
    f = gl(2, 5, labels = letters[1:2]),
    g = sample(gl(2, 5, labels = LETTERS[1:2])))

    分组回归:nlme::lmList 的快速方法
    group_by_simpleLM(dat, c("y1", "y2"), "x", "f")
    #$y1
    # a b
    #alpha 0.820107094 -2.7164723
    #beta -0.009796302 0.8812007
    #beta.se 0.266690568 0.2090644
    #r2 0.000449565 0.8555330
    #df.resid 3.000000000 3.0000000
    #
    #$y2
    # a b
    #alpha 0.1304709 0.06996587
    #beta -0.1616069 -0.14685953
    #beta.se 0.2465047 0.24815024
    #r2 0.1253142 0.10454374
    #df.resid 3.0000000 3.00000000

    fit <- nlme::lmList(cbind(y1, y2) ~ x | f, data = dat, pool = FALSE)

    ## results for level "a"; use `fit[[2]]` to see results for level "b"
    lapply(summary(fit[[1]]), "[", c("coefficients", "r.squared"))
    #$`Response y1`
    #$`Response y1`$coefficients
    # Estimate Std. Error t value Pr(>|t|)
    #(Intercept) 0.820107094 0.8845125 0.92718537 0.4222195
    #x -0.009796302 0.2666906 -0.03673284 0.9730056
    #
    #$`Response y1`$r.squared
    #[1] 0.000449565
    #
    #
    #$`Response y2`
    #$`Response y2`$coefficients
    # Estimate Std. Error t value Pr(>|t|)
    #(Intercept) 0.1304709 0.8175638 0.1595850 0.8833471
    #x -0.1616069 0.2465047 -0.6555936 0.5588755
    #
    #$`Response y2`$r.squared
    #[1] 0.1253142

    无故障处理秩不足
    ## with unused level "b"
    group_by_simpleLM(dat[1:5, ], "y1", "x", "f")
    #$y1
    # a b
    #alpha 0.820107094 NaN
    #beta -0.009796302 NaN
    #beta.se 0.266690568 NaN
    #r2 0.000449565 NaN
    #df.resid 3.000000000 NaN

    ## rank-deficient case for level "b"
    group_by_simpleLM(dat[1:6, ], "y1", "x", "f")
    #$y1
    # a b
    #alpha 0.820107094 -1.53995
    #beta -0.009796302 NaN
    #beta.se 0.266690568 NaN
    #r2 0.000449565 NaN
    #df.resid 3.000000000 0.00000

    多个分组变量

    当我们有多个分组变量时, group_by_simpleLM不能直接处理。但是你可以使用 interaction首先创建一个单因素变量。
    dat$fg <- with(dat, interaction(f, g, drop = TRUE, sep = ":"))
    group_by_simpleLM(dat, c("y1", "y2"), "x", "fg")
    #$y1
    # a:A b:A a:B b:B
    #alpha 1.4750325 -2.7684583 -1.6393289 -1.8513669
    #beta -0.2120782 0.9861509 0.7993313 0.4613999
    #beta.se 0.0000000 0.2098876 0.4946167 0.0000000
    #r2 1.0000000 0.9566642 0.7231188 1.0000000
    #df.resid 0.0000000 1.0000000 1.0000000 0.0000000
    #
    #$y2
    # a:A b:A a:B b:B
    #alpha 1.0292956 -0.22746944 -1.5096975 0.06876360
    #beta -0.2657021 -0.20650690 0.2547738 0.09172993
    #beta.se 0.0000000 0.01945569 0.3483856 0.00000000
    #r2 1.0000000 0.99120195 0.3484482 1.00000000
    #df.resid 0.0000000 1.00000000 1.0000000 0.00000000

    fit <- nlme::lmList(cbind(y1, y2) ~ x | fg, data = dat, pool = FALSE)

    ## note that the first group a:A only has two values, so df.resid = 0
    ## my method returns 0 standard error for the slope
    ## but lm or lmList would return NaN
    lapply(summary(fit[[1]]), "[", c("coefficients", "r.squared"))
    #$`Response y1`
    #$`Response y1`$coefficients
    # Estimate Std. Error t value Pr(>|t|)
    #(Intercept) 1.4750325 NaN NaN NaN
    #x -0.2120782 NaN NaN NaN
    #
    #$`Response y1`$r.squared
    #[1] 1
    #
    #
    #$`Response y2`
    #$`Response y2`$coefficients
    # Estimate Std. Error t value Pr(>|t|)
    #(Intercept) 1.0292956 NaN NaN NaN
    #x -0.2657021 NaN NaN NaN
    #
    #$`Response y2`$r.squared
    #[1] 1

    不分组:lm 的快速方法
    group_by_simpleLM(dat, c("y1", "y2"), "x", NULL)
    #$y1
    # ALL
    #alpha -0.9481826
    #beta 0.4357022
    #beta.se 0.2408162
    #r2 0.2903691
    #df.resid 8.0000000
    #
    #$y2
    # ALL
    #alpha 0.1002184
    #beta -0.1542332
    #beta.se 0.1514935
    #r2 0.1147012
    #df.resid 8.0000000

    快速大简单线性回归
    set.seed(0L)
    nSubj <- 200e3
    nr <- 1e6
    DF <- data.frame(subject = gl(nSubj, 5),
    day = 3:7,
    y1 = runif(nr),
    y2 = rpois(nr, 3),
    y3 = rnorm(nr),
    y4 = rnorm(nr, 1, 5))

    system.time(group_by_simpleLM(DF, paste0("y", 1:4), "day", "subject"))
    # user system elapsed
    # 0.200 0.016 0.219

    library(MatrixModels)
    system.time(glm4(y1 ~ 0 + subject + day:subject, data = DF, sparse = TRUE))
    # user system elapsed
    # 9.012 0.172 9.266
    group_by_simpleLM几乎立即完成所有 4 个响应,而 glm4仅一项响应就需要 9 秒!

    请注意 glm4在等级不足的情况下可能会崩溃,而 group_by_simpleLM不会。

    附录:“group_by_simpleLM.cpp”
    #include <Rcpp.h>
    using namespace Rcpp;

    // [[Rcpp::export]]
    List group_by_simpleLM_cpp (List Y, NumericVector x, IntegerVector group, CharacterVector group_levels, bool group_unsorted) {

    /* number of data and number of responses */
    int n = x.size(), k = Y.size(), n_groups = group_levels.size();

    /* set up result list */
    List result(k);
    List dimnames = List::create(CharacterVector::create("alpha", "beta", "beta.se", "r2", "df.resid"), group_levels);
    int j; for (j = 0; j < k; j++) {
    NumericMatrix mat(5, n_groups);
    mat.attr("dimnames") = dimnames;
    result[j] = mat;
    }
    result.attr("names") = Y.attr("names");

    /* set up a vector to hold sample size for each group */
    size_t *group_offset = (size_t *)calloc(n_groups + 1, sizeof(size_t));

    /*
    compute group offset: cumsum(group_offset)
    The offset is used in a different way when group is sorted or unsorted
    In the former case, it is the offset to real x, y values;
    In the latter case, it is the offset to ordering index indx
    */
    int *u = INTEGER(group), *u_end = u + n, i;
    if (n_groups > 1) {
    while (u < u_end) group_offset[*u++]++;
    for (i = 0; i < n_groups; i++) group_offset[i + 1] += group_offset[i];
    } else {
    group_offset[1] = n;
    group_unsorted = 0;
    }

    /* local variables & pointers */
    double *xi, *xi_end; /* pointer to the 1st and the last x value */
    double *yi; /* pointer to the first y value */
    int gi; double inv_gi; /* sample size of the i-th group */
    double xi_mean, xi_var; /* mean & variance of x values in the i-th group */
    double yi_mean, yi_var; /* mean & variance of y values in the i-th group */
    double xiyi_cov; /* covariance between x and y values in the i-th group */
    double beta, r2; int df_resi;
    double *matij;

    /* additional storage and variables when group is unsorted */
    int *indx; double *xb, *xbi, dtmp;
    if (group_unsorted) {
    indx = (int *)malloc(n * sizeof(int));
    xb = (double *)malloc(n * sizeof(double)); // buffer x for caching
    R_orderVector1(indx, n, group, TRUE, FALSE); // Er, how is TRUE & FALSE recogonized as Rboolean?
    }

    /* loop through groups */
    for (i = 0; i < n_groups; i++) {
    /* set group size gi */
    gi = group_offset[i + 1] - group_offset[i];
    /* special case for a factor level with no data */
    if (gi == 0) {
    for (j = 0; j < k; j++) {
    /* matrix column for write-back */
    matij = REAL(result[j]) + i * 5;
    matij[0] = R_NaN; matij[1] = R_NaN; matij[2] = R_NaN;
    matij[3] = R_NaN; matij[4] = R_NaN;
    }
    continue;
    }
    /* rank-deficient case */
    if (gi == 1) {
    gi = group_offset[i];
    if (group_unsorted) gi = indx[gi];
    for (j = 0; j < k; j++) {
    /* matrix column for write-back */
    matij = REAL(result[j]) + i * 5;
    matij[0] = REAL(Y[j])[gi];
    matij[1] = R_NaN; matij[2] = R_NaN;
    matij[3] = R_NaN; matij[4] = 0.0;
    }
    continue;
    }
    /* general case where a regression line can be estimated */
    inv_gi = 1 / (double)gi;
    /* compute mean & variance of x values in this group */
    xi_mean = 0.0; xi_var = 0.0;
    if (group_unsorted) {
    /* use u, u_end and xbi */
    xi = REAL(x);
    u = indx + group_offset[i]; /* offset acts on index */
    u_end = u + gi;
    xbi = xb + group_offset[i];
    for (; u < u_end; xbi++, u++) {
    dtmp = xi[*u];
    xi_mean += dtmp;
    xi_var += dtmp * dtmp;
    *xbi = dtmp;
    }
    } else {
    /* use xi and xi_end */
    xi = REAL(x) + group_offset[i]; /* offset acts on values */
    xi_end = xi + gi;
    for (; xi < xi_end; xi++) {
    xi_mean += *xi;
    xi_var += (*xi) * (*xi);
    }
    }
    xi_mean = xi_mean * inv_gi;
    xi_var = xi_var * inv_gi - xi_mean * xi_mean;
    /* loop through responses doing simple linear regression */
    for (j = 0; j < k; j++) {
    /* compute mean & variance of y values, as well its covariance with x values */
    yi_mean = 0.0; yi_var = 0.0; xiyi_cov = 0.0;
    if (group_unsorted) {
    xbi = xb + group_offset[i]; /* use buffered x values */
    yi = REAL(Y[j]);
    u = indx + group_offset[i]; /* offset acts on index */
    for (; u < u_end; u++, xbi++) {
    dtmp = yi[*u];
    yi_mean += dtmp;
    yi_var += dtmp * dtmp;
    xiyi_cov += dtmp * (*xbi);
    }
    } else {
    /* set xi and yi */
    xi = REAL(x) + group_offset[i]; /* offset acts on values */
    yi = REAL(Y[j]) + group_offset[i]; /* offset acts on values */
    for (; xi < xi_end; xi++, yi++) {
    yi_mean += *yi;
    yi_var += (*yi) * (*yi);
    xiyi_cov += (*yi) * (*xi);
    }
    }
    yi_mean = yi_mean * inv_gi;
    yi_var = yi_var * inv_gi - yi_mean * yi_mean;
    xiyi_cov = xiyi_cov * inv_gi - xi_mean * yi_mean;
    /* locate the right place to write back regression result */
    matij = REAL(result[j]) + i * 5 + 4;
    /* residual degree of freedom */
    df_resi = gi - 2; *matij-- = (double)df_resi;
    /* R-squared = squared correlation */
    r2 = (xiyi_cov * xiyi_cov) / (xi_var * yi_var); *matij-- = r2;
    /* standard error of regression slope */
    if (df_resi == 0) *matij-- = 0.0;
    else *matij-- = sqrt((1 - r2) * yi_var / (df_resi * xi_var));
    /* regression slope */
    beta = xiyi_cov / xi_var; *matij-- = beta;
    /* regression intercept */
    *matij = yi_mean - beta * xi_mean;
    }
    }

    if (group_unsorted) {
    free(indx);
    free(xb);
    }
    free(group_offset);
    return result;
    }

    /*** R
    group_by_simpleLM <- function (dat, LHS, RHS, group = NULL) {

    ## basic input validation
    if (missing(dat)) stop("no data provided to 'dat'!")
    if (!is.data.frame(dat)) stop("'dat' must be a data frame!")

    if (missing(LHS)) stop("no 'LHS' provided!")
    if (!is.character(LHS)) stop("'LHS' must be provided as a character vector of variable names!")

    if (missing(RHS)) stop("no 'RHS' provided!")
    if (!is.character(RHS)) stop("'RHS' must be provided as a character vector of variable names!")

    if (!is.null(group)) {

    ## grouping variable provided: a fast method of `nlme::lmList`

    if (!is.character(group)) stop("'group' must be provided as a character vector of variable names!")

    ## ensure that group has length 1, is available in the data frame and is a factor
    if (length(group) > 1L) {
    warning("only one grouping variable allowed for group-by simple linear regression; ignoring all but the 1st variable provided!")
    group <- group[1L]
    }
    grp <- dat[[group]]
    if (is.null(grp)) stop(sprintf("grouping variable '%s' not found in 'dat'!", group))

    if (is.factor(grp)) {
    grp_levels <- levels(grp)
    } else {
    warning("grouping variable is not provided as a factor; fast coercion is made!")
    grp_levels <- unique(grp)
    grp <- match(grp, grp_levels)
    grp_levels <- as.character(grp_levels)
    }

    grp_unsorted <- .Internal(is.unsorted(grp, FALSE))

    } else {

    ## no grouping; a fast method of `lm`
    grp <- 1L; grp_levels <- "ALL"; grp_unsorted <- FALSE

    }

    ## the RHS must has length 1, is available in the data frame and is numeric
    if (length(RHS) > 1L) {
    warning("only one RHS variable allowed for simple linear regression; ignoring all but the 1st variable provided!")
    RHS <- RHS[1L]
    }
    x <- dat[[RHS]]
    if (is.null(x)) stop(sprintf("RHS variable '%s' not found in 'dat'!", RHS))
    if (!is.numeric(x) || is.factor(x)) {
    stop("RHS variable must be 'numeric' for simple linear regression!")
    }
    x < as.numeric(x) ## just in case that `x` is an integer

    ## check LHS variables
    nested <- match(RHS, LHS, nomatch = 0L)
    if (nested > 0L) {
    warning(sprintf("RHS variable '%s' found in LHS variables; removing it from LHS", RHS))
    LHS <- LHS[-nested]
    }
    if (length(LHS) == 0L) stop("no usable LHS variables found!")
    missed <- !(LHS %in% names(dat))
    if (any(missed)) {
    warning(sprintf("LHS variables '%s' not found in 'dat'; removing them from LHS", toString(LHS[missed])))
    LHS <- LHS[!missed]
    }
    if (length(LHS) == 0L) stop("no usable LHS variables found!")
    Y <- dat[LHS]
    invalid_LHS <- vapply(Y, is.factor, FALSE) | (!vapply(Y, is.numeric, FALSE))
    if (any(invalid_LHS)) {
    warning(sprintf("LHS variables '%s' are non-numeric or factors; removing them from LHS", toString(LHS[invalid_LHS])))
    Y <- Y[!invalid_LHS]
    }
    if (length(Y) == 0L) stop("no usable LHS variables found!")
    Y <- lapply(Y, as.numeric) ## just in case that we have integer variables in Y

    ## check for exsitence of NA, NaN, Inf and -Inf and drop them?

    ## use Rcpp
    group_by_simpleLM_cpp(Y, x, grp, grp_levels, grp_unsorted)
    }
    */

    关于r - 快速分组简单线性回归,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/51996021/

    25 4 0
    Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
    广告合作:1813099741@qq.com 6ren.com