predict() function produces different values than hand calculation in a glmer(Forecate()函数产生的值与手工计算的值不同)-6ren

predict() function produces different values than hand calculation in a glmer(Forecate()函数产生的值与手工计算的值不同)

转载作者：bug小助手更新时间：2023-10-26 20:40:12

I'm trying to obtain predicted probability of the source of my data (coded as 0 or 1, for source A and source B) from a glmer model.
Using example data:

我试图从glmer模型中获得数据源的预测概率（对于源A和源B，编码为0或1）。使用示例数据：

set.seed(123)
n<-7052
Df <- data.frame(
  source = sample(c(0, 1), n, replace = TRUE, 
      prob = c(0.719, 0.221)),  
  Response.number = sample(1:20, n, replace = TRUE),  
  Item.number = sample(1:40, n, replace = TRUE), 
  Ps.number = sample(1:40, n, replace = TRUE)  
)


Model1 <- glmer(source ~  (1|Response.number/Item.number) +
    (1|Ps.number), 
     data=Df,  family = binomial, 
       glmerControl(optimizer="bobyqa"))

As per https://sebastiansauer.github.io/convert_logit2prob/, the hand calculation (exp(b)/(1+(exp(b)) produces the predicted probability the same as the function below:

根据https://sebastiansauer.github.io/convert_logit2prob/，，手工计算(exp(B)/(1+(exp(B)产生与以下函数相同的预测概率：

probability <- predict(Model1, type="response")
mean(probability)

I tried it with multiple types of practice data and this generally works (in the above example, it's 0.23199). However, when I use my actual data, I'm getting a slightly different value from the predict function (0.59) than by hand (0.57). I know it's not a lot but this discrepancy doesn't occur when I use any other data.

我用多种类型的练习数据进行了尝试，这通常是有效的(在上面的例子中，它是0.23199)。然而，当我使用我的实际数据时，我从预测函数得到的值(0.59)与手动获得的值(0.57)略有不同。我知道这不是很多，但当我使用任何其他数据时，这种差异不会发生。

head(Df_real)
      source    Response.number  Item.number  Ps.number
           0               1         1         1
           0               2         1         1
           1               3         1         1
           1               4         1         1
           0               5         1         1
           0               6         1         1
           0               1         2         1
           0               2         2         1
           1               3         2         1
           1               4         2         1
           0               5         2         1
           0               6         2         1
           0               1         1         2
           0               2         1         2
           1               3         1         2
           1               4         1         2
           0               5         1         2
           0               6         1         2
           0               1         2         2
           0               2         2         2
           1               3         2         2
           1               4         2         2
           0               5         2         2

etc.

等。

The data is nested, that is, there is roughly the same amount of participants per each value of response, the same amount of responses per each value of item, etc. Can this be the source of the discrepancy? If so, how to deal with it? Is the predict() function appropriate?

数据是嵌套的，也就是说，每个回答值有大致相同的参与者数量，每个项目值有相同数量的回答，等等。这会是差异的来源吗？如果是这样，该如何应对？Forecate()函数是否合适？

更多回答

We probably need to see your whole data set, or at least a subset that will allow us to repeat some computation exactly. plogis(x) does the same thing as exp(x)/(1+exp(x)) and is likely to be slightly more reliable in cases of extremely large or small x.

我们可能需要查看您的整个数据集，或者至少是允许我们准确重复某些计算的子集。Plogis(X)的作用与exp(X)/(1+exp(X))相同，在x极大或极小的情况下可能会稍微可靠一些。

You haven't really shown us what you are doing "by hand". What x value are you using? How are you getting a single number from predict ?

你还没有真正向我们展示你在“手工”做什么。您使用的x值是多少？你如何从预测中得到一个单一的数字？

@BenBolker I presumed the OP was using b as the intercept of the model (there are no fixed effect variables in the formula)

@BenBolker我假设OP使用b作为模型的截距(公式中没有固定的效果变量)

sorry, it's a mean of predicted values, corrected now. x in the formula you wrote is the intercept of the model, that's right

对不起，这是预测值的平均值，现在已更正。你写的公式中的X是模型的截距，没错

优秀答案推荐

When you run predict in glmer, it uses the variables present in your original data (including random effects) to estimate the probability, so you predict will not return a vector of values that are all the same as the single value you get by running exp(b)/(1 + exp(b)) on the fixed effect coefficient.

当你在glmer中运行predict时，它使用原始数据中存在的变量（包括随机效应）来估计概率，所以你的predict不会返回一个向量值，这些值与你通过对固定效应系数运行exp（b）/（1 + exp（b））得到的单个值相同。

To see this, let's try passing a little data frame of the random effect variables to the newdata argument of predict:

要了解这一点，让我们尝试将随机效果变量的一个小数据帧传递给Forecate的newdata参数：

predict(Model1, newdata = data.frame(Item.number = 1, 
                                     Response.number = c(1, 2), 
                                     Ps.number = 1), type = 'response')  
#>         1         2 
#> 0.2261900 0.2405297

Since you don't have any fixed effects in your model, the overall probability (accounting for the random effects) would simply be:

由于您的模型中没有任何固定的效果，因此总体概率(考虑随机效果)将简单地为：

b <- fixef(Model1)
exp(b)/(1 + exp(b))
#> (Intercept) 
#>   0.2319048

As Ben Bolker points out in the comments, this is not the same as the raw proportion in the data due to the bias adjustment used in glmms. He also points out that we can remove the random effects from predict using re.form = NA, which will give you the same value as the transformed intercept:

正如本·博尔克在评论中指出的那样，由于GMMS中使用的偏差调整，这与数据中的原始比例不同。他还指出，我们可以使用re.form=na从Forecast中删除随机效果，这将为您提供与转换后的截取相同的值：

mean(predict(Model1, type= 'response', re.form = NA)) == plogis(fixef(Model1))
#> (Intercept)
#>        TRUE

So it really depends on what you want to predict, i.e. whether you want the random variables taken into account or not. If you do, you can use predict, otherwise you can hand calculate from the fixed effects or use re.form = NA inside predict

因此，这实际上取决于你想要预测什么，也就是说，你是否希望将随机变量考虑在内。如果您这样做，您可以使用预测，否则，您可以手动计算固定的效果或使用re.form=NA预测内

As a side note, the base R function plogis is probably the easiest way to convert log odds to probability, and it clearly works here - we can see that using type = "response" is equivalent to plogis(predict(Model1, type = "link"))

顺便提一下，基本R函数plogis可能是将对数赔率转换为概率的最简单方法，它显然在这里有效--我们可以看到，使用type=“Response”等同于plogis(Forecate(Model1，type=“link”))

all(
  plogis(predict(Model1, type = "link")) == predict(Model1, type = "response")
)
#> [1] TRUE

Calculating it by hand is fine, though you will get very small floating point differences:

手工计算是很好的，尽管您会得到非常小的浮点差：

b <- predict(Model1, type = "link")

hist(exp(b)/(1 + exp(b)) - predict(Model1, type = 'response'))

So the smart way to hand calculate the overall probability from your model is

因此，从您的模型手动计算总体概率的聪明方法是

plogis(fixef(Model1))
#> (Intercept) 
#>   0.2319048

I think there's an important point that you might be missing when comparing the means of the data with the means of the predictions. @AllanCameron's comment that plogis(mean(predict(model))) is not the same as mean(plogis(predict(model))) (this is Jensen's inequality).

我认为，在比较数据的均值和预测的均值时，你可能会遗漏一个重要的观点。@AllanCameron的评论plogis(Mean(Forecate(Model)与Mean(Plogis(Forecate(Model)不同(这是Jensen的不等式)。

library(lme4)
library(emmeans)
set.seed(123)
n <- 7052
Df <- data.frame(
  Response.number = sample(1:20, n, replace = TRUE),  
  Item.number = sample(1:40, n, replace = TRUE), 
  Ps.number = sample(1:40, n, replace = TRUE)  
)
Df$source <- simulate(~(1|Response.number/Item.number) +  (1|Ps.number),
   family = binomial,
   newdata = Df,
   newparams = list(beta = qlogis(0.7), theta = c(1, 1, 1)))[[1]]
fit <- glmer(source ~(1|Response.number/Item.number) +  (1|Ps.number),
   family = binomial,
   data = Df)

mean(Df$source)  ## 0.6498866
(p1 <- predict(fit, newdata = data.frame(dummy = 1), re.form = NA)) ## 0.9280772
plogis(p1)  ## 0.716685
(p2 <- predict(fit, newdata = data.frame(dummy = 1), 
    re.form = NA, type = "response")) ## 0.716685
emmeans(fit, ~ 1)
emmeans(fit, ~ 1)
##  1       emmean    SE  df asymp.LCL asymp.UCL
##  overall  0.928 0.263 Inf     0.412      1.44
emmeans(fit, ~ 1, type = "response")
## 1        prob     SE  df asymp.LCL asymp.UCL
##  overall 0.717 0.0535 Inf     0.602     0.809

From emmeans vignette

来自EmMeans Vignette

vars <- sapply(VarCorr(fit), c)
total.SD <- sqrt(sum(vars^2))
emmeans(fit, ~ 1, type = "response", bias.adj = TRUE,
  sigma = total.SD)
##  1        prob     SE  df asymp.LCL asymp.UCL
##  overall 0.614 0.0398 Inf     0.545     0.698

The bias correction isn't exact (it uses a delta method approximation) so that's not quite right, but it's closer.

偏差修正并不准确(它使用增量方法近似)，所以这不是很正确，但它更接近。

This is a little better:

这个更好一点：

library(logitnorm)
momentsLogitnorm(mu = fixef(fit), sigma = total.SD)
##       mean        var 
## 0.65790176 0.06472473

Or:

或者：

mean(predict(fit, type = "response")) ## 0.6500409

更多回答

This is a good answer, but there is a very important distinction to make. If we run a GLM (no random effect), the mean of the data will be the same as the back-transformed intercept: set.seed(101); x <- rbinom(100, size = 1, prob = 0.2); g <- glm(x ~ 1, family = binomial); all.equal(unname(plogis(coef(g))), mean(x)). However, this is not true for GLMMs; e.g. see cran.r-project.org/web/packages/emmeans/vignettes/…

这是一个很好的答案，但有一个非常重要的区别。如果我们运行GLM(无随机效应)，数据的平均值将与反向转换的截距相同：set.Seed(101)；x<-rbinom(100，Size=1，prob=0.2)；g<-GLM(x~1，Family=二项式)；all.equal(unname(plogis(coef(G))，ean(X)。然而，GLMM并非如此；例如，请参阅cran.r-project.org/web/packages/emmeans/vignettes/…

Also, if you want random effects ignored, you can use predict(model, re.form = NA)

此外，如果你想忽略随机效果，你可以使用Forecate(模型，re.form=NA)

@BenBolker I think then that this is essentially the answer that the OP was looking for. mean(predict(Model1, type= 'response', re.form = NA)) is the same as plogis(fixef(Model1))

@BenBolker我认为这基本上就是行动所寻找的答案。Mean(Forecate(Model1，type=‘Response’，re.form=NA))与plogis(fix ef(Model 1))相同

Thanks! The "re.form = NA" part is what makes the difference between R and my hand calculations - but I don't think I want random effects ignored, they are the only effects I have in the model. So is it ok to use the value from R even though it doesn't match hand calculation and attribute the difference to random effects?

谢谢!“re.form=na”部分是R和我的手工计算之间的不同之处--但我不认为我希望随机效果被忽略，它们是我在模型中唯一的效果。那么，使用R中的值可以吗？即使它与手工计算不匹配，并将差异归因于随机效果？

@Agata what difference though? The mean of the probability output given by predict?

@Agata但有什么不同呢？预测给出的概率输出的平均值？

ios - 魔法记录: Calculate data with calculations
我知道 Magical Record 支持聚合操作，例如 sum:、max: 但是有没有办法进行一些简单的计算，例如: 总和:属性 * other_attributes 如果我们知道这些属性的值为 N
design-patterns - 用户统计 : "interative calculation" or bulk calculation + caching
我有一个项目可以计算一些关于用户表现的“统计数据”，然后将其展示给他们。所有这些统计数据最终都来自一个记录用户与网站交互的大型“交互”表。目前，所有这些统计数据都是通过查看这些数据来计算的。我们广泛使
Connors RSI Calculation Python Not Calculating Correctly(Connors RSI计算Python计算不正确)
我正在试着用熊猫和NumPy来计算蟒蛇中的Connors RSI。我想用ConnorsRSI的默认值(3，2,100)来计算它。。Connors RSI的公式为：[RSI(Close，3)+RSI(S
r - 面板数据 : Calculate group means while omitting first period from calculation
我对某种 mean() 计算有疑问。我使用带有两个标识符“ID”和“year”的面板数据集(使用 plm pkg) 我想计算变量“y”的分组平均值，但省略了第一年的计算条目，然后仅填写用于计算它的年份
excel - VBA捕获 "calculate sheet (shift+f9)"和 "calculate workbook"事件
我不知道这是否是微不足道的或实际上很棘手:是否可以捕获 VBA 中的“计算工作表 (shift+f9)”和“计算工作簿”事件？我想隐藏一些操作几千行的进程，只显示一些关键值。我正在计算分布，数千行，
sql - PostgreSQL View : Referencing one calculated field in another calculated field
我和#1895500有同样的问题, 但使用 PostgreSQL 而不是 MySQL。如何定义具有计算字段的 View ，例如: (mytable.col1 * 2) AS times_two .
sql - MySQL View : Referencing one calculated field (by name) in another calculated field
如何定义具有两个计算字段的 View ，例如... ('TableName'.'BlueSquares' + 'TableName'.'RedSquares') AS TotalSquares, (
powerbi - CALCULATE(m, x=red) 与 CALCULATE(m, KEEPFILTERS(x=red)) 之间的差异
CALCULATE(m, x=red) 和 CALCULATE(m, KEEPFILTERS(x=red)) 之间有什么区别显然它们不一样。我找到了文档和解释，但我仍然不明白。 https://le
java - 线程 "main"java.lang.NoClassDefFoundError : calculator (wrong name: apackage/calculator) 中出现异常
我正在尝试从命令提示符运行我的 Java 类文件，当我尝试这样做时，我收到此错误 C:\Users\New User\workspace\myproject\bin\apackage>java cal
calculator - If Then 语句显示所有可能性
我正在尝试根据用户的输入显示文本。例如输入单词 APPLE 应该让它显示 BANANA。这段代码工作正常: :Input X :If X=APPLE :Disp "BANANA" 但我不确定如何以此
calculator - 帮助程序员的最佳计算器软件
Closed. This question does not meet Stack Overflow guidelines。它当前不接受答案。想改善这个问题吗？更新问题，以便将其作为on-topic
calculation - 注册篮子计算结果查看
我们正在尝试实现自己的自定义购物篮计算规则集并注册新的结果 View 来获取购物篮计算结果，但我们无法找到一些如何注册新结果 View 类的信息？我们使用这里的示例:https://support.
calculator - TI 基本数字标准
数字变量是否遵循 TI 计算器上的记录标准？我真的很惊讶地注意到我的 TI 83 Premium CE 测试实际上返回了 true(即 1): 0.1 -> X 0.1 -> Y 0.01 -> Z
calculator - Snake 风格游戏的简约方法
大约两天前，我收到了我的 TI-82 STATS 可编程计算器(实际上更像是一个 TI-83) - 并想用内置的 TI-BASIC 语言编写一个贪吃蛇游戏。虽然我不得不找出:TI-BASIC 是极
calculator - 使用两个以上参数计算最小值/最大值
作为家庭作业，我们有一个基本的计算器，它只能进行+运算，我们必须实现更多的功能。我们必须实现括号运算符、符号运算符和最小最大函数。最后的任务之一是扩展最小/最大函数以计算具有两个以上参数的最小/最大，
Excel : calculate a column only
如何从 Excel 的单元格中选择一列，然后仅计算该列？我只知道 SHIFT + F9 可以计算整个工作表，F9 可以计算整个工作簿。谢谢你们;) 最佳答案我认为仅使用标准 Excel 无法做到这
Java-Calculator 多计算
我已经为计算器编写了代码，但它还不能 100% 可靠地工作。每次我进行计算时，例如:“1+1=2”，并且我想要进行另一次计算，我必须关闭小程序并重新启动它。我怎样才能让它回到开始的地方。这是代码:
C程序: Calculating Interest
意图:该程序要求用户提供其银行帐户中当前的金额、年利率和年数。输出是金额的开始和结束，显示用户指定年份的累计利息。问题:我正在尝试找到一种正确添加利息的方法，截至目前，在指定的年份里，我所做的就是乘
javascript calculator(我怎么让第一次点击不接受操作)
我怎么让第一次点击不接受操作返回“0” 这是我的功能 $(document).ready(function(){ $('button').on('click', function(){
224. Basic Calculator 基本计算器
题目地址：https://leetcode.com/problems/basic-calculator/description/ 题目描述 Implement a basic calculator

bug小助手

个人简介

我是一名优秀的程序员,十分优秀！

作者热门文章

滴滴打车优惠券免费领取

全站热门文章

首页

博学

6Ren·AI

商城

predict() function produces different values than hand calculation in a glmer(Forecate()函数产生的值与手工计算的值不同)