gpt4 book ai didi

r - 使用 rstan 对贝叶斯模型进行线性模型诊断

转载 作者:行者123 更新时间:2023-12-04 19:37:10 25 4
gpt4 key购买 nike

我正在寻找一种有效的方法来识别对线性模型的参数有巨大影响的数据点。这对于普通的线性模型来说是直截了当的,但我不确定如何使用贝叶斯线性模型来做到这一点。

这是使用普通线性模型的一种方法,我们可以计算每个数据点的库克距离,并绘制包含库克距离的诊断图:

# ordinary linear model diagnostics, similar to my use-case
library(dplyr)
library(purrr)
library(tidyr)
library(broom)
# make linear models for each type of species
xy <-
iris %>%
nest(-Species) %>%
mutate(model = map(data,
~lm(Sepal.Length ~ Petal.Width,
data = .)),
fit = map(model, augment))

这里我们有一个带有嵌套列表的数据框, model列包含每个物种的线性模型:
> xy
# A tibble: 3 × 4
Species data model fit
<fctr> <list> <list> <list>
1 setosa <tibble [50 × 4]> <S3: lm> <data.frame [50 × 9]>
2 versicolor <tibble [50 × 4]> <S3: lm> <data.frame [50 × 9]>
3 virginica <tibble [50 × 4]> <S3: lm> <data.frame [50 × 9]>
broom::augment函数允许我们将每个数据点的 Cook 距离值添加到这个数据框中,我们可以像这样检查它们:
# inspect Cook's distance values
xy %>%
unnest(fit) %>%
arrange(desc(.cooksd))

# A tibble: 150 × 10
Species Sepal.Length Petal.Width .fitted .se.fit .resid .hat .sigma .cooksd
<fctr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 versicolor 5.9 1.8 6.612097 0.16181001 -0.7120969 0.13725081 0.4269862 0.24507448
2 setosa 5.0 0.6 5.335281 0.17114108 -0.3352811 0.25027563 0.3410686 0.21385214
3 virginica 4.9 1.7 6.375829 0.13613717 -1.4758292 0.04875277 0.5826838 0.15434787
4 setosa 5.7 0.4 5.149247 0.08625887 0.5507534 0.06357957 0.3355980 0.09396588
5 setosa 4.3 0.1 4.870195 0.08321347 -0.5701948 0.05916942 0.3349111 0.09285408
6 virginica 5.8 2.4 6.831411 0.14828703 -1.0314106 0.05784319 0.6035012 0.09117693
7 virginica 7.2 1.6 6.310746 0.16207266 0.8892538 0.06909799 0.6084108 0.08293253
8 versicolor 4.9 1.0 5.471005 0.11998077 -0.5710051 0.07546185 0.4328038 0.07544526
9 setosa 5.8 0.2 4.963212 0.05287342 0.8367879 0.02388828 0.3228858 0.07500610
10 versicolor 6.0 1.0 5.471005 0.11998077 0.5289949 0.07546185 0.4340307 0.06475225
# ... with 140 more rows, and 1 more variables: .std.resid <dbl>

autoplot方法,我们可以制作显示库克距离值的信息诊断图,并帮助我们快速识别对模型参数有巨大影响的数据点:
# plot model diagnostics
library(ggplot2)
library(ggfortify)
diagnostic_plots_df <-
xy %>%
mutate(diagnostic_plots = map(model,
~autoplot(.,
which = 1:6,
ncol = 3,
label.size = 3)))

这只是制作的地 block 之一:
> diagnostic_plots_df[[1]]

enter image description here

现在,使用贝叶斯线性模型,我们可以类似地计算数据框中每个组的线性模型:
# Bayesian linear model diagnostics
library(rstanarm)
bayes_xy <-
iris %>%
nest(-Species) %>%
mutate(model = map(data,
~stan_lm(Sepal.Length ~ Petal.Width,
data = .,
prior = NULL,
chains = 1,
cores = 2,
seed = 1)),
fit = map(model, augment))

> bayes_xy
# A tibble: 3 × 4
Species data model fit
<fctr> <list> <list> <list>
1 setosa <tibble [50 × 4]> <S3: stanreg> <data.frame [50 × 5]>
2 versicolor <tibble [50 × 4]> <S3: stanreg> <data.frame [50 × 5]>
3 virginica <tibble [50 × 4]> <S3: stanreg> <data.frame [50 × 5]>

但是 broom::augment方法没有任何像库克的距离值:
# inspect fit diagnostics
bayes_xy %>% unnest(fit)

# A tibble: 150 × 6
Species Sepal.Length Petal.Width .fitted .se.fit .resid
<fctr> <dbl> <dbl> <dbl> <dbl> <dbl>
1 setosa 5.1 0.2 4.963968 0.06020298 0.13482025
2 setosa 4.9 0.2 4.963968 0.06020298 -0.06517975
3 setosa 4.7 0.2 4.963968 0.06020298 -0.26517975
4 setosa 4.6 0.2 4.963968 0.06020298 -0.36517975
5 setosa 5.0 0.2 4.963968 0.06020298 0.03482025
6 setosa 5.4 0.4 5.151501 0.11299956 0.21818386
7 setosa 4.6 0.3 5.057734 0.05951488 -0.47349794
8 setosa 5.0 0.2 4.963968 0.06020298 0.03482025
9 setosa 4.4 0.2 4.963968 0.06020298 -0.56517975
10 setosa 4.9 0.1 4.870201 0.11408783 0.04313845
# ... with 140 more rows

没有 autoplot方法:
# plot model diagnostics
bayes_diagnostic_plots_df <-
bayes_xy %>%
mutate(diagnostic_plots = map(model,
~autoplot(.,
which = 1:6,
ncol = 3,
label.size = 3)))

# Error, there doesn't seem to be an autoplot method for stan_lm output

shinystan::launch_shinystan(bayes_xy$model[[1]])

# This is quite interesting, but nothing at the level of specific data points

一些学术文献谈到了诸如 model perturbations 之类的方法。 , phi-divergence, Cook's posterior mode distance and Cook's posterior mean distance , Kullback-Leibler divergence , etc. , etc. .但是我看不到任何用 R 代码探索过的地方,我被困住了。

an unanswered question在交叉验证的这个主题上。我在这里发帖是因为我正在寻找关于编写代码来计算影响统计的想法(而不是关于统计理论和方法等的建议,应该针对其他问题)

如何从 rstanarm::stan_lm 的输出中获得类似库克距离测量值的结果? ?

最佳答案

这个post Aki Vehtari 说得最好:

The difference between lppd_i and loo_i has been used as a sensitivity measure (see, e.g., Gelfand et al 1992). Pareto shape parameter estimate k is likely to be large if the difference between lppd_i and loo_i is large. It's not yet clear to me whether the Pareto shape parameter estimate k would be better than lppd_i-loo_i, but at least we know that estimate for lppd_i-loo_i is too small if k is close to 1 or larger, so it might be better to look at k. In stack example with normal model, k for one observation is large, but with student-t model k is smaller. Normal model is same as student-t model, but with very strong prior on degrees of freedom. So it's not just about having strong prior or more shrinkage, but having a model which can describe the observations well. With increased shrinkage and non-robust observation model, the one observation could still be surprising. Naturally it's not always the best solution to change to a more robust observation model allowing "outliers". Instead it might be better to make the regression function more nonlinear (that is having a less strong prior), or transform covariates, or add more covariates. So I do recommend looking at Pareto shape parameter values, but I don't recommend to increase shrinkage if the values are large.



您可以从 $pareto_k 获得帕累托形状参数估计 k loo 生成的列表的元素loo 包中的函数,由 rstanarm 包重新导出。如果该值高于 0.7(默认), loo函数将建议您重新拟合模型而忽略此观察,因为后验分布可能对该观察过于敏感,无法满足 LOOIC 的假设,即每个观察对后验分布的影响可以忽略不计。

在 OP 的情况下,只有第七个观测值的帕累托形状参数估计值略大于 0.5,因此观测值可能对后验没有极端影响。但您肯定想调查值大于 1.0 的观测值

您也可以调用 plot loo 对象的方法,尤其是使用非默认选项 label_points = TRUE可视化帕累托形状参数估计。

关于r - 使用 rstan 对贝叶斯模型进行线性模型诊断,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/39578834/

25 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com