r - 是否有简单回归的快速估计(只有截距和斜率的回归线)？

转载作者：行者123 更新时间：2023-12-05 08:56:29

这个问题与机器学习特征选择过程有关。

我有一个很大的特征矩阵——列是主题的特征(行):

set.seed(1)
features.mat <- matrix(rnorm(10*100),ncol=100)
colnames(features.mat) <- paste("F",1:100,sep="")
rownames(features.mat) <- paste("S",1:10,sep="")

每个对象 (S) 在不同条件 (C) 下的响应是测量的，因此看起来像这样:

response.df <-
data.frame(S = c(sapply(1:10, function(x) rep(paste("S", x, sep = ""),100))),
           C = rep(paste("C", 1:100, sep = ""), 10),
           response = rnorm(1000), stringsAsFactors = F)

所以我匹配response.df中的主题:

match.idx <- match(response.df$S, rownames(features.mat))

我正在寻找一种快速方法来计算每个特征和响应的单变量回归。

还有比这更快的吗？

fun <- function(f){
  fit <- lm(response.df$response ~ features.mat[match.idx,f])
  beta <- coef(summary(fit))
  data.frame(feature = colnames(features.mat)[f], effect = beta[2,1],
             p.val = beta[2,4], stringsAsFactors = F))
  }

res <- do.call(rbind, lapply(1:ncol(features.mat), fun))

我对边际提升感兴趣，即除了通过 mclapply 或 mclapply2 使用并行计算之外的方法。

最佳答案

我会提供一个轻量级的玩具例程来估计一个简单的回归模型:y ~ x，即只有截距和斜率的回归线。 正如将会看到的那样，这比 lm + summary.lm 快 36 倍。

## toy data
set.seed(0)
x <- runif(50)
y <- 0.3 * x + 0.1 + rnorm(50, sd = 0.05)

## fast estimation of simple linear regression: y ~ x 
simplelm <- function (x, y) {
  ## number of data
  n <- length(x)
  ## centring
  y0 <- sum(y) / length(y); yc <- y - y0
  x0 <- sum(x) / length(x); xc <- x - x0
  ## fitting an intercept-free model: yc ~ xc + 0
  xty <- c(crossprod(xc, yc))
  xtx <- c(crossprod(xc))
  slope <- xty / xtx
  rc <- yc - xc * slope
  ## Pearson estimate of residual standard error
  sigma2 <- c(crossprod(rc)) / (n - 2)
  ## standard error for slope
  slope_se <- sqrt(sigma2 / xtx)
  ## t-score and p-value for slope
  tscore <- slope / slope_se
  pvalue <- 2 * pt(abs(tscore), n - 2, lower.tail = FALSE)
  ## return estimation summary for slope
  c("Estimate" = slope, "Std. Error" = slope_se, "t value" = tscore, "Pr(>|t|)" = pvalue)
  }

让我们来做个测试:

simplelm(x, y)

#    Estimate   Std. Error      t value     Pr(>|t|) 
#2.656737e-01 2.279663e-02 1.165408e+01 1.337380e-15

另一方面，lm + summary.lm 给出:

coef(summary(lm(y ~ x)))

#             Estimate Std. Error   t value     Pr(>|t|)
#(Intercept) 0.1154549 0.01373051  8.408633 5.350248e-11
#x           0.2656737 0.02279663 11.654079 1.337380e-15

所以结果匹配。如果您需要 R 平方和调整后的 R 平方，也可以轻松计算。

让我们有一个基准:

set.seed(0)
x <- runif(10000)
y <- 0.3 * x + 0.1 + rnorm(10000, sd = 0.05)

library(microbenchmark)

microbenchmark(coef(summary(lm(y ~ x))), simplelm(x, y))

#Unit: microseconds
#                     expr      min       lq       mean   median       uq
# coef(summary(lm(y ~ x))) 14158.28 14305.28 17545.1544 14444.34 17089.00
#           simplelm(x, y)   235.08   265.72   485.4076   288.20   319.46
#      max neval cld
# 114662.2   100   b
#   3409.6   100  a

神圣!!!我们有 36 倍的提升!

Remark-1(求解正规方程)

simplelm 基于通过 Cholesky 分解求解正规方程。但由于它很简单，因此不涉及实际的矩阵计算。如果我们需要使用多个协变量进行回归，我们可以使用 lm.chol defined in my this answer .

也可以使用 LU 分解来求解正规方程。我不会谈这个，但如果你有兴趣，这里是:Solving normal equation gives different coefficients from using lm? .

Remark-2(通过 `cor.test` 替代)

simplelm 是对我的回答 Monte Carlo simulation of correlation between two Brownian motion (continuous random walk) 中的 fastsim 的扩展.另一种方法是基于 cor.test。它也比 lm + summary.lm 快得多，但如该答案所示，它比我上面的建议慢。

备注 3(通过 QR 方法替代)

基于 QR 的方法也是可能的，在这种情况下我们要使用 .lm.fit，qr.default 的轻量级包装器，qr C 级的 .coef、qr.fitted 和 qr.resid。以下是我们如何将此选项添加到我们的 simplelm:

## fast estimation of simple linear regression: y ~ x 
simplelm <- function (x, y, QR = FALSE) {
  ## number of data
  n <- length(x)
  ## centring
  y0 <- sum(y) / length(y); yc <- y - y0
  x0 <- sum(x) / length(x); xc <- x - x0
  ## fitting intercept free model: yc ~ xc + 0
  if (QR) {
    fit <- .lm.fit(matrix(xc), yc)
    slope <- fit$coefficients
    rc <- fit$residuals
    } else {
    xty <- c(crossprod(xc, yc))
    xtx <- c(crossprod(xc))
    slope <- xty / xtx
    rc <- yc - xc * slope
    }
  ## Pearson estimate of residual standard error
  sigma2 <- c(crossprod(rc)) / (n - 2)
  ## standard error for slope
  if (QR) {
    slope_se <- sqrt(sigma2) / abs(fit$qr[1])
    } else {
    slope_se <- sqrt(sigma2 / xtx)
    }
  ## t-score and p-value for slope
  tscore <- slope / slope_se
  pvalue <- 2 * pt(abs(tscore), n - 2, lower.tail = FALSE)
  ## return estimation summary for slope
  c("Estimate" = slope, "Std. Error" = slope_se, "t value" = tscore, "Pr(>|t|)" = pvalue)
  }

对于我们的玩具数据，QR 方法和 Cholesky 方法都给出相同的结果:

set.seed(0)
x <- runif(50)
y <- 0.3 * x + 0.1 + rnorm(50, sd = 0.05)

simplelm(x, y, TRUE)

#    Estimate   Std. Error      t value     Pr(>|t|) 
#2.656737e-01 2.279663e-02 1.165408e+01 1.337380e-15 

simplelm(x, y, FALSE)

#    Estimate   Std. Error      t value     Pr(>|t|) 
#2.656737e-01 2.279663e-02 1.165408e+01 1.337380e-15

已知 QR 方法比 Cholesky 方法慢 2 ~ 3 倍(阅读我的回答 Why the built-in lm function is so slow in R? 以获得详细解释)。这是一个快速检查:

set.seed(0)
x <- runif(10000)
y <- 0.3 * x + 0.1 + rnorm(10000, sd = 0.05)

library(microbenchmark)

microbenchmark(simplelm(x, y, TRUE), simplelm(x, y))

#Unit: microseconds
#                 expr    min     lq      mean median     uq     max neval cld
# simplelm(x, y, TRUE) 776.88 873.26 1073.1944 908.72 933.82 3420.92   100   b
#       simplelm(x, y) 238.32 292.02  441.9292 310.44 319.32 3515.08   100  a

的确如此，908/310 = 2.93。

Remark-4(GLM 的简单回归)

如果我们继续使用 GLM，还有一个基于 glm.fit 的快速、轻量级版本。你可以看我的回答R loop help: leave out one observation and run glm one variable at a time并使用那里定义的函数 f。目前 f 是为逻辑回归定制的，但我们可以很容易地将它推广到其他响应。

关于r - 是否有简单回归的快速估计(只有截距和斜率的回归线)？，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/40141738/

文章推荐： prolog - 如何将跟踪输出重定向到文件

文章推荐： Python 正则表达式 findall 或

文章推荐： google-publisher-tag - 注册到事件的 Google 发布商代码

r - 具有多个组的 3D 密度图(回归线)
我想创建这样的情节: 这里我们有两个变量 X = midyear 和 Y = yearend。我想为 Y 的 X 的每个级别创建密度图。我可以做到这一点，但看起来并不像我想象的那样，特别是情节和线条
python - 将公式应用于 pandas 行/回归线
我正在尝试在数据框的行上应用公式来获取行中数字的趋势。下面的示例在使用 .apply 的部分之前一直有效。 df = pd.DataFrame(np.random.randn(10, 4), col

行者123

个人简介

我是一名优秀的程序员,十分优秀！

作者热门文章

滴滴打车优惠券免费领取

全站热门文章

首页

博学

6Ren·AI

商城

r - 是否有简单回归的快速估计(只有截距和斜率的回归线)？

Remark-1(求解正规方程)

Remark-2(通过 `cor.test` 替代)

备注 3(通过 QR 方法替代)

Remark-4(GLM 的简单回归)

首页

博学

6Ren·AI

商城

r - 是否有简单回归的快速估计(只有截距和斜率的回归线)？

Remark-1(求解正规方程)

Remark-2(通过 cor.test 替代)

备注 3(通过 QR 方法替代)

Remark-4(GLM 的简单回归)

Remark-2(通过 `cor.test` 替代)