gpt4 book ai didi

r - `lm` : how to get prediction variance of sum of predicted values 的线性模型

转载 作者:行者123 更新时间:2023-12-04 09:42:56 28 4
gpt4 key购买 nike

我正在对具有多个预测变量的线性模型的预测值求和,如下例所示,并希望计算该总和的组合方差、标准误差和可能的置信区间。

lm.tree <- lm(Volume ~ poly(Girth,2), data = trees)

假设我有一组 Girths :
newdat <- list(Girth = c(10,12,14,16)

我想预测的总数 Volume :
pr <- predict(lm.tree, newdat, se.fit = TRUE)
total <- sum(pr$fit)
# [1] 111.512

如何获得 total 的方差?

类似的问题是 here (for GAMs) ,但我不知道如何处理 vcov(lm.trees) .我将不胜感激提供该方法的引用。

最佳答案

您需要获得完整的方差-协方差矩阵,然后将其所有元素相加。 这是小证明:

enter image description here

这里的证明是使用另一个定理,你可以从 Covariance-wikipedia 中找到。 :

enter image description here

具体来说,我们采用的线性变换是一个全为 1 的列矩阵。计算得到的二次形式as following , 与所有 x_ix_j是 1。

enter image description here

设置

## your model
lm.tree <- lm(Volume ~ poly(Girth, 2), data = trees)

## newdata (a data frame)
newdat <- data.frame(Girth = c(10, 12, 14, 16))

重新实现 predict.lm计算方差-协方差矩阵

How does predict.lm() compute confidence interval and prediction interval?对于如何 predict.lm作品。下面的小功能 lm_predict模仿它的作用,除了
  • 它不构造置信度或预测区间(但构造非常简单,如该问答中所述);
  • 如果diag = FALSE,它可以计算预测值的完整方差-协方差矩阵;
  • 它返回方差(对于预测值和残差),而不是标准误差;
  • 做不到type = "terms" ;它只预测响应变量。

  • lm_predict <- function (lmObject, newdata, diag = TRUE) {
    ## input checking
    if (!inherits(lmObject, "lm")) stop("'lmObject' is not a valid 'lm' object!")
    ## extract "terms" object from the fitted model, but delete response variable
    tm <- delete.response(terms(lmObject))
    ## linear predictor matrix
    Xp <- model.matrix(tm, newdata)
    ## predicted values by direct matrix-vector multiplication
    pred <- c(Xp %*% coef(lmObject))
    ## efficiently form the complete variance-covariance matrix
    QR <- lmObject$qr ## qr object of fitted model
    piv <- QR$pivot ## pivoting index
    r <- QR$rank ## model rank / numeric rank
    if (is.unsorted(piv)) {
    ## pivoting has been done
    B <- forwardsolve(t(QR$qr), t(Xp[, piv]), r)
    } else {
    ## no pivoting is done
    B <- forwardsolve(t(QR$qr), t(Xp), r)
    }
    ## residual variance
    sig2 <- c(crossprod(residuals(lmObject))) / df.residual(lmObject)
    if (diag) {
    ## return point-wise prediction variance
    VCOV <- colSums(B ^ 2) * sig2
    } else {
    ## return full variance-covariance matrix of predicted values
    VCOV <- crossprod(B) * sig2
    }
    list(fit = pred, var.fit = VCOV, df = lmObject$df.residual, residual.var = sig2)
    }

    我们可以将其输出与 predict.lm 的输出进行比较。 :
    predict.lm(lm.tree, newdat, se.fit = TRUE)
    #$fit
    # 1 2 3 4
    #15.31863 22.33400 31.38568 42.47365
    #
    #$se.fit
    # 1 2 3 4
    #0.9435197 0.7327569 0.8550646 0.8852284
    #
    #$df
    #[1] 28
    #
    #$residual.scale
    #[1] 3.334785

    lm_predict(lm.tree, newdat)
    #$fit
    #[1] 15.31863 22.33400 31.38568 42.47365
    #
    #$var.fit ## the square of `se.fit`
    #[1] 0.8902294 0.5369327 0.7311355 0.7836294
    #
    #$df
    #[1] 28
    #
    #$residual.var ## the square of `residual.scale`
    #[1] 11.12079

    特别是:
    oo <- lm_predict(lm.tree, newdat, FALSE)
    oo
    #$fit
    #[1] 15.31863 22.33400 31.38568 42.47365
    #
    #$var.fit
    # [,1] [,2] [,3] [,4]
    #[1,] 0.89022938 0.3846809 0.04967582 -0.1147858
    #[2,] 0.38468089 0.5369327 0.52828797 0.3587467
    #[3,] 0.04967582 0.5282880 0.73113553 0.6582185
    #[4,] -0.11478583 0.3587467 0.65821848 0.7836294
    #
    #$df
    #[1] 28
    #
    #$residual.var
    #[1] 11.12079

    请注意,方差-协方差矩阵不是以天真的方式计算的: Xp %*% vcov(lmObject) % t(Xp) ,这很慢。

    聚合(总和)

    在您的情况下,聚合操作是 oo$fit 中所有值的总和.这种聚合的均值和方差是
    sum_mean <- sum(oo$fit)  ## mean of the sum
    # 111.512

    sum_variance <- sum(oo$var.fit) ## variance of the sum
    # 6.671575

    您可以通过使用 t 分布和模型中的残差自由度进一步构建此聚合值的置信区间 (CI)。
    alpha <- 0.95
    Qt <- c(-1, 1) * qt((1 - alpha) / 2, lm.tree$df.residual, lower.tail = FALSE)
    #[1] -2.048407 2.048407

    ## %95 CI
    sum_mean + Qt * sqrt(sum_variance)
    #[1] 106.2210 116.8029

    构建预测区间(PI)需要进一步考虑残差方差。
    ## adjusted variance-covariance matrix
    VCOV_adj <- with(oo, var.fit + diag(residual.var, nrow(var.fit)))

    ## adjusted variance for the aggregation
    sum_variance_adj <- sum(VCOV_adj) ## adjusted variance of the sum

    ## 95% PI
    sum_mean + Qt * sqrt(sum_variance_adj)
    #[1] 96.86122 126.16268

    聚合(一般)

    一般的聚合操作可以是 oo$fit 的线性组合:
    w[1] * fit[1] + w[2] * fit[2] + w[3] * fit[3] + ...

    例如,求和运算的所有权重都为 1;平均操作的所有权重均为 0.25(在 4 个数据的情况下)。这是采用权重向量、显着性水平以及 lm_predict 返回的函数的函数生成聚合的统计信息。
    agg_pred <- function (w, predObject, alpha = 0.95) {
    ## input checing
    if (length(w) != length(predObject$fit)) stop("'w' has wrong length!")
    if (!is.matrix(predObject$var.fit)) stop("'predObject' has no variance-covariance matrix!")
    ## mean of the aggregation
    agg_mean <- c(crossprod(predObject$fit, w))
    ## variance of the aggregation
    agg_variance <- c(crossprod(w, predObject$var.fit %*% w))
    ## adjusted variance-covariance matrix
    VCOV_adj <- with(predObject, var.fit + diag(residual.var, nrow(var.fit)))
    ## adjusted variance of the aggregation
    agg_variance_adj <- c(crossprod(w, VCOV_adj %*% w))
    ## t-distribution quantiles
    Qt <- c(-1, 1) * qt((1 - alpha) / 2, predObject$df, lower.tail = FALSE)
    ## names of CI and PI
    NAME <- c("lower", "upper")
    ## CI
    CI <- setNames(agg_mean + Qt * sqrt(agg_variance), NAME)
    ## PI
    PI <- setNames(agg_mean + Qt * sqrt(agg_variance_adj), NAME)
    ## return
    list(mean = agg_mean, var = agg_variance, CI = CI, PI = PI)
    }

    对之前的求和运算的快速测试:
    agg_pred(rep(1, length(oo$fit)), oo)
    #$mean
    #[1] 111.512
    #
    #$var
    #[1] 6.671575
    #
    #$CI
    # lower upper
    #106.2210 116.8029
    #
    #$PI
    # lower upper
    # 96.86122 126.16268

    以及对平均操作的快速测试:
    agg_pred(rep(1, length(oo$fit)) / length(oo$fit), oo)
    #$mean
    #[1] 27.87799
    #
    #$var
    #[1] 0.4169734
    #
    #$CI
    # lower upper
    #26.55526 29.20072
    #
    #$PI
    # lower upper
    #24.21531 31.54067

    评论

    此答案已改进,为 Linear regression with `lm()`: prediction interval for aggregated predicted values 提供易于使用的功能.

    升级(大数据)

    This is great! Thank you so much! There is one thing I forgot to mention: in my actual application I need to sum ~300,000 predictions which would create a full variance-covariance matrix which is about ~700GB in size. Do you have any idea if there is a computationally more efficient way to directly get to the sum of the variance-covariance matrix?



    感谢 Linear regression with `lm()`: prediction interval for aggregated predicted values 的 OP对于这个非常有用的评论。是的,这是可能的,而且(显着)计算成本也更低。目前, lm_predict形成方差-协方差如下:

    enter image description here
    agg_pred将预测方差(用于构建 CI)计算为二次形式: w'(B'B)w ,以及作为另一个二次形式的预测方差(用于构造 PI) w'(B'B + D)w ,其中 D是残差方差的对角矩阵。显然,如果我们融合这两个函数,我们就有了更好的计算策略:

    enter image description here
    B的计算和 B'B被避免;我们已将所有矩阵-矩阵乘法替换为矩阵-向量乘法。 B 没有内存存储和 B'B ;仅适用于 u这只是一个向量。这是融合的实现。
    ## this function requires neither `lm_predict` nor `agg_pred`
    fast_agg_pred <- function (w, lmObject, newdata, alpha = 0.95) {
    ## input checking
    if (!inherits(lmObject, "lm")) stop("'lmObject' is not a valid 'lm' object!")
    if (!is.data.frame(newdata)) newdata <- as.data.frame(newdata)
    if (length(w) != nrow(newdata)) stop("length(w) does not match nrow(newdata)")
    ## extract "terms" object from the fitted model, but delete response variable
    tm <- delete.response(terms(lmObject))
    ## linear predictor matrix
    Xp <- model.matrix(tm, newdata)
    ## predicted values by direct matrix-vector multiplication
    pred <- c(Xp %*% coef(lmObject))
    ## mean of the aggregation
    agg_mean <- c(crossprod(pred, w))
    ## residual variance
    sig2 <- c(crossprod(residuals(lmObject))) / df.residual(lmObject)
    ## efficiently compute variance of the aggregation without matrix-matrix computations
    QR <- lmObject$qr ## qr object of fitted model
    piv <- QR$pivot ## pivoting index
    r <- QR$rank ## model rank / numeric rank
    u <- forwardsolve(t(QR$qr), c(crossprod(Xp, w))[piv], r)
    agg_variance <- c(crossprod(u)) * sig2
    ## adjusted variance of the aggregation
    agg_variance_adj <- agg_variance + c(crossprod(w)) * sig2
    ## t-distribution quantiles
    Qt <- c(-1, 1) * qt((1 - alpha) / 2, lmObject$df.residual, lower.tail = FALSE)
    ## names of CI and PI
    NAME <- c("lower", "upper")
    ## CI
    CI <- setNames(agg_mean + Qt * sqrt(agg_variance), NAME)
    ## PI
    PI <- setNames(agg_mean + Qt * sqrt(agg_variance_adj), NAME)
    ## return
    list(mean = agg_mean, var = agg_variance, CI = CI, PI = PI)
    }

    让我们快速测试一下。
    ## sum opeartion
    fast_agg_pred(rep(1, nrow(newdat)), lm.tree, newdat)
    #$mean
    #[1] 111.512
    #
    #$var
    #[1] 6.671575
    #
    #$CI
    # lower upper
    #106.2210 116.8029
    #
    #$PI
    # lower upper
    # 96.86122 126.16268

    ## average operation
    fast_agg_pred(rep(1, nrow(newdat)) / nrow(newdat), lm.tree, newdat)
    #$mean
    #[1] 27.87799
    #
    #$var
    #[1] 0.4169734
    #
    #$CI
    # lower upper
    #26.55526 29.20072
    #
    #$PI
    # lower upper
    #24.21531 31.54067

    是的,答案是正确的!

    关于r - `lm` : how to get prediction variance of sum of predicted values 的线性模型,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/39337862/

    28 4 0
    Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
    广告合作:1813099741@qq.com 6ren.com