I'm trying to obtain predicted probability of the source of my data (coded as 0 or 1, for source A and source B) from a glmer model.
Using example data:
我试图从glmer模型中获得数据源的预测概率(对于源A和源B,编码为0或1)。使用示例数据:
set.seed(123)
n<-7052
Df <- data.frame(
source = sample(c(0, 1), n, replace = TRUE,
prob = c(0.719, 0.221)),
Response.number = sample(1:20, n, replace = TRUE),
Item.number = sample(1:40, n, replace = TRUE),
Ps.number = sample(1:40, n, replace = TRUE)
)
Model1 <- glmer(source ~ (1|Response.number/Item.number) +
(1|Ps.number),
data=Df, family = binomial,
glmerControl(optimizer="bobyqa"))
As per https://sebastiansauer.github.io/convert_logit2prob/, the hand calculation (exp(b)/(1+(exp(b))
produces the predicted probability the same as the function below:
根据https://sebastiansauer.github.io/convert_logit2prob/,,手工计算(exp(B)/(1+(exp(B)产生与以下函数相同的预测概率:
probability <- predict(Model1, type="response")
mean(probability)
I tried it with multiple types of practice data and this generally works (in the above example, it's 0.23199). However, when I use my actual data, I'm getting a slightly different value from the predict function (0.59) than by hand (0.57). I know it's not a lot but this discrepancy doesn't occur when I use any other data.
我用多种类型的练习数据进行了尝试,这通常是有效的(在上面的例子中,它是0.23199)。然而,当我使用我的实际数据时,我从预测函数得到的值(0.59)与手动获得的值(0.57)略有不同。我知道这不是很多,但当我使用任何其他数据时,这种差异不会发生。
head(Df_real)
source Response.number Item.number Ps.number
0 1 1 1
0 2 1 1
1 3 1 1
1 4 1 1
0 5 1 1
0 6 1 1
0 1 2 1
0 2 2 1
1 3 2 1
1 4 2 1
0 5 2 1
0 6 2 1
0 1 1 2
0 2 1 2
1 3 1 2
1 4 1 2
0 5 1 2
0 6 1 2
0 1 2 2
0 2 2 2
1 3 2 2
1 4 2 2
0 5 2 2
etc.
等。
The data is nested, that is, there is roughly the same amount of participants per each value of response, the same amount of responses per each value of item, etc. Can this be the source of the discrepancy? If so, how to deal with it? Is the predict()
function appropriate?
数据是嵌套的,也就是说,每个回答值有大致相同的参与者数量,每个项目值有相同数量的回答,等等。这会是差异的来源吗?如果是这样,该如何应对?Forecate()函数是否合适?
更多回答
We probably need to see your whole data set, or at least a subset that will allow us to repeat some computation exactly. plogis(x)
does the same thing as exp(x)/(1+exp(x))
and is likely to be slightly more reliable in cases of extremely large or small x
.
我们可能需要查看您的整个数据集,或者至少是允许我们准确重复某些计算的子集。Plogis(X)的作用与exp(X)/(1+exp(X))相同,在x极大或极小的情况下可能会稍微可靠一些。
You haven't really shown us what you are doing "by hand". What x
value are you using? How are you getting a single number from predict
?
你还没有真正向我们展示你在“手工”做什么。您使用的x值是多少?你如何从预测中得到一个单一的数字?
@BenBolker I presumed the OP was using b
as the intercept of the model (there are no fixed effect variables in the formula)
@BenBolker我假设OP使用b作为模型的截距(公式中没有固定的效果变量)
sorry, it's a mean of predicted values, corrected now. x in the formula you wrote is the intercept of the model, that's right
对不起,这是预测值的平均值,现在已更正。你写的公式中的X是模型的截距,没错
When you run predict
in glmer
, it uses the variables present in your original data (including random effects) to estimate the probability, so you predict
will not return a vector of values that are all the same as the single value you get by running exp(b)/(1 + exp(b))
on the fixed effect coefficient.
当你在glmer中运行predict时,它使用原始数据中存在的变量(包括随机效应)来估计概率,所以你的predict不会返回一个向量值,这些值与你通过对固定效应系数运行exp(b)/(1 + exp(b))得到的单个值相同。
To see this, let's try passing a little data frame of the random effect variables to the newdata
argument of predict
:
要了解这一点,让我们尝试将随机效果变量的一个小数据帧传递给Forecate的newdata参数:
predict(Model1, newdata = data.frame(Item.number = 1,
Response.number = c(1, 2),
Ps.number = 1), type = 'response')
#> 1 2
#> 0.2261900 0.2405297
Since you don't have any fixed effects in your model, the overall probability (accounting for the random effects) would simply be:
由于您的模型中没有任何固定的效果,因此总体概率(考虑随机效果)将简单地为:
b <- fixef(Model1)
exp(b)/(1 + exp(b))
#> (Intercept)
#> 0.2319048
As Ben Bolker points out in the comments, this is not the same as the raw proportion in the data due to the bias adjustment used in glmms. He also points out that we can remove the random effects from predict
using re.form = NA
, which will give you the same value as the transformed intercept:
正如本·博尔克在评论中指出的那样,由于GMMS中使用的偏差调整,这与数据中的原始比例不同。他还指出,我们可以使用re.form=na从Forecast中删除随机效果,这将为您提供与转换后的截取相同的值:
mean(predict(Model1, type= 'response', re.form = NA)) == plogis(fixef(Model1))
#> (Intercept)
#> TRUE
So it really depends on what you want to predict, i.e. whether you want the random variables taken into account or not. If you do, you can use predict
, otherwise you can hand calculate from the fixed effects or use re.form = NA
inside predict
因此,这实际上取决于你想要预测什么,也就是说,你是否希望将随机变量考虑在内。如果您这样做,您可以使用预测,否则,您可以手动计算固定的效果或使用re.form=NA预测内
As a side note, the base R function plogis
is probably the easiest way to convert log odds to probability, and it clearly works here - we can see that using type = "response"
is equivalent to plogis(predict(Model1, type = "link"))
顺便提一下,基本R函数plogis可能是将对数赔率转换为概率的最简单方法,它显然在这里有效--我们可以看到,使用type=“Response”等同于plogis(Forecate(Model1,type=“link”))
all(
plogis(predict(Model1, type = "link")) == predict(Model1, type = "response")
)
#> [1] TRUE
Calculating it by hand is fine, though you will get very small floating point differences:
手工计算是很好的,尽管您会得到非常小的浮点差:
b <- predict(Model1, type = "link")
hist(exp(b)/(1 + exp(b)) - predict(Model1, type = 'response'))
So the smart way to hand calculate the overall probability from your model is
因此,从您的模型手动计算总体概率的聪明方法是
plogis(fixef(Model1))
#> (Intercept)
#> 0.2319048
I think there's an important point that you might be missing when comparing the means of the data with the means of the predictions. @AllanCameron's comment that plogis(mean(predict(model)))
is not the same as mean(plogis(predict(model)))
(this is Jensen's inequality).
我认为,在比较数据的均值和预测的均值时,你可能会遗漏一个重要的观点。@AllanCameron的评论plogis(Mean(Forecate(Model)与Mean(Plogis(Forecate(Model)不同(这是Jensen的不等式)。
library(lme4)
library(emmeans)
set.seed(123)
n <- 7052
Df <- data.frame(
Response.number = sample(1:20, n, replace = TRUE),
Item.number = sample(1:40, n, replace = TRUE),
Ps.number = sample(1:40, n, replace = TRUE)
)
Df$source <- simulate(~(1|Response.number/Item.number) + (1|Ps.number),
family = binomial,
newdata = Df,
newparams = list(beta = qlogis(0.7), theta = c(1, 1, 1)))[[1]]
fit <- glmer(source ~(1|Response.number/Item.number) + (1|Ps.number),
family = binomial,
data = Df)
mean(Df$source) ## 0.6498866
(p1 <- predict(fit, newdata = data.frame(dummy = 1), re.form = NA)) ## 0.9280772
plogis(p1) ## 0.716685
(p2 <- predict(fit, newdata = data.frame(dummy = 1),
re.form = NA, type = "response")) ## 0.716685
emmeans(fit, ~ 1)
emmeans(fit, ~ 1)
## 1 emmean SE df asymp.LCL asymp.UCL
## overall 0.928 0.263 Inf 0.412 1.44
emmeans(fit, ~ 1, type = "response")
## 1 prob SE df asymp.LCL asymp.UCL
## overall 0.717 0.0535 Inf 0.602 0.809
From emmeans vignette
来自EmMeans Vignette
vars <- sapply(VarCorr(fit), c)
total.SD <- sqrt(sum(vars^2))
emmeans(fit, ~ 1, type = "response", bias.adj = TRUE,
sigma = total.SD)
## 1 prob SE df asymp.LCL asymp.UCL
## overall 0.614 0.0398 Inf 0.545 0.698
The bias correction isn't exact (it uses a delta method approximation) so that's not quite right, but it's closer.
偏差修正并不准确(它使用增量方法近似),所以这不是很正确,但它更接近。
This is a little better:
这个更好一点:
library(logitnorm)
momentsLogitnorm(mu = fixef(fit), sigma = total.SD)
## mean var
## 0.65790176 0.06472473
Or:
或者:
mean(predict(fit, type = "response")) ## 0.6500409
更多回答
This is a good answer, but there is a very important distinction to make. If we run a GLM (no random effect), the mean of the data will be the same as the back-transformed intercept: set.seed(101); x <- rbinom(100, size = 1, prob = 0.2); g <- glm(x ~ 1, family = binomial); all.equal(unname(plogis(coef(g))), mean(x))
. However, this is not true for GLMMs; e.g. see cran.r-project.org/web/packages/emmeans/vignettes/…
这是一个很好的答案,但有一个非常重要的区别。如果我们运行GLM(无随机效应),数据的平均值将与反向转换的截距相同:set.Seed(101);x<-rbinom(100,Size=1,prob=0.2);g<-GLM(x~1,Family=二项式);all.equal(unname(plogis(coef(G)),ean(X)。然而,GLMM并非如此;例如,请参阅cran.r-project.org/web/packages/emmeans/vignettes/…
Also, if you want random effects ignored, you can use predict(model, re.form = NA)
此外,如果你想忽略随机效果,你可以使用Forecate(模型,re.form=NA)
@BenBolker I think then that this is essentially the answer that the OP was looking for. mean(predict(Model1, type= 'response', re.form = NA))
is the same as plogis(fixef(Model1))
@BenBolker我认为这基本上就是行动所寻找的答案。Mean(Forecate(Model1,type=‘Response’,re.form=NA))与plogis(fix ef(Model 1))相同
Thanks! The "re.form = NA" part is what makes the difference between R and my hand calculations - but I don't think I want random effects ignored, they are the only effects I have in the model. So is it ok to use the value from R even though it doesn't match hand calculation and attribute the difference to random effects?
谢谢!“re.form=na”部分是R和我的手工计算之间的不同之处--但我不认为我希望随机效果被忽略,它们是我在模型中唯一的效果。那么,使用R中的值可以吗?即使它与手工计算不匹配,并将差异归因于随机效果?
@Agata what difference though? The mean of the probability output given by predict
?
@Agata但有什么不同呢?预测给出的概率输出的平均值?
我是一名优秀的程序员,十分优秀!