随机森林回归 - 累积 MSE？-6ren

随机森林回归 - 累积 MSE？

转载作者：行者123 更新时间：2023-11-30 09:16:19

28

4

我是随机森林新手，我有一个关于回归的问题。我正在使用 R 包 randomForests 来计算 RF 模型。

我的最终目标是选择对预测连续性状很重要的变量集，因此我正在计算一个模型，然后删除准确度平均下降最低的变量，然后计算一个新模型，依此类推。这适用于 RF 分类，我使用来自预测(训练集)、开发和验证数据集的 OOB 误差来比较模型。现在，通过回归，我想比较基于解释的 %variation 和 MSE 的模型。

我正在评估 MSE 和 %var 解释的结果，并且在使用 model$predicted 的预测手动计算时得到完全相同的结果。但是，当我执行 model$mse 时，显示的值对应于最后计算的树的 MSE 值，并且 % var 解释时也会发生同样的情况。

作为示例，您可以在 R 中尝试此代码:

library(randomForest)
data("iris")
head(iris)

TrainingX<-iris[1:100,2:4] #creating training set - X matrix
TrainingY<-iris[1:100,1]  #creating training set - Y vector

TestingX<-iris[101:150,2:4]  #creating test set - X matrix
TestingY<-iris[101:150,1]  #creating test set - Y vector

set.seed(2)

model<-randomForest(x=TrainingX, y= TrainingY, ntree=500, #calculating model
                    xtest = TestingX, ytest = TestingY)

#for prediction (training set)

pred<-model$predicted

meanY<-sum(TrainingY)/length(TrainingY)

varpY<-sum((TrainingY-meanY)^2)/length(TrainingY)

mseY<-sum((TrainingY-pred)^2)/length(TrainingY)

r2<-(1-(mseY/varpY))*100

#for testing (test set)

pred_2<-model$test$predicted

meanY_2<-sum(TestingY)/length(TestingY)

varpY_2<-sum((TestingY-meanY_2)^2)/length(TestingY)

mseY_2<-sum((TestingY-pred_2)^2)/length(TestingY)

r2_2<-(1-(mseY_2/varpY_2))*100

training_set_mse<-c(model$mse[500], mseY)
training_set_rsq<-c(model$rsq[500]*100, r2)
testing_set_mse<-c(model$test$mse[500],mseY_2)
testing_set_rsq<-c(model$test$rsq[500]*100, r2_2)

c<-cbind(training_set_mse,training_set_rsq,testing_set_mse, testing_set_rsq)
rownames(c)<-c("last tree", "by hand")
c
model

运行此代码后，您将获得一个包含 MSE 和 %varexplaines(也称为 rsq)值的表。第一行称为“最后一棵树”，包含为森林中第 500 棵树解释的 MSE 和 %var 值。第二行称为“手动”，它包含基于向量 model$predicted 和 model$test$predicted 在 R 中计算的结果。

所以，我的问题是:

1- 树的预测是否以某种方式累积？或者说它们是相互独立的？ (我以为他们是独立的)

2- 最后一棵树是否被视为所有其他树的平均值？

3- 为什么 RF 模型的 MSE 和 %var 解释(当您调用 model 时在主板中显示)与第 500 棵树中的相同(参见表的第一行) ？向量 model$mse 或 model$rsq 是否包含累积值？

最后一次编辑后，我发现 Andy Liaw(该包的创建者之一)发表的这篇文章说 MSE 和 %var 解释实际上是累积的!: https://stat.ethz.ch/pipermail/r-help/2004-April/049943.html .

最佳答案

不确定我理解您的问题是什么；不过我还是会尝试一下...

1- Are the predictions of the trees somehow cumulative? Or are they independent from each other? (I thought they were independent)

你的想法是对的；这些树彼此独立拟合，因此它们的预测确实是独立的。事实上，这是 RF 模型的一个关键优势，因为它允许并行实现。

2- Is the last tree to be considered as an average of all the others?

否；如上所述，所有树都是独立的。

3- If each tree gets a prediction, how can I get the matrix with all the trees, since what I need is the MSE and % var explained for the forest?

鉴于上面的代码，您所问的问题开始变得非常不清楚；您所说的 MSE 和 r2 正是您已经在 mseY 和 r2 中计算的内容:

mseY
[1] 0.1232342

r2
[1] 81.90718

毫不奇怪，这与模型报告的值完全相同:

model
# result:

Call:
 randomForest(x = TrainingX, y = TrainingY, ntree = 500) 
               Type of random forest: regression
                     Number of trees: 500
No. of variables tried at each split: 1

          Mean of squared residuals: 0.1232342
                    % Var explained: 81.91

所以我不确定我是否真的能看到你的问题，或者这些值与“所有树的矩阵”有什么关系......

But when I do model$mse, the value presented corresponds to the value of MSE for the last tree calculated, and the same happens for % var explained.

~~肯定不是:model$mse 是一个长度等于树数(此处为 500)的向量，包含每棵树的 MSE ;~~(参见下面的更新)我在实践中从未见过它有任何用途(与 model$rsq 类似):

length(model$mse)
[1] 500

length(model$rsq)
[1] 500

更新:感谢OP本人(参见评论)，她发现model$mse和model$rsq中的数量是确实累积(!)；来自包维护者 Andy Liaw 的旧线程(2004 年)，Extracting the MSE and % Variance from RandomForest :

Several ways:

Read ?randomForest, especially the `Value' section.

Look at str(myforest.rf).

Look at print.randomForest.

If the forest has 100 trees, then the mse and rsq are vectors with 100 elements each, the i-th element being the mse (or rsq) of the forest consisting of the first i trees. So the last element is the mse (or rsq) of the whole forest.

关于随机森林回归 - 累积 MSE？，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/55198048/

28

4

0

文章推荐： C# ecdsa 签名 - 我可以选择哪个 key 规范？

文章推荐： javascript - 如何为条形图创建图例？

文章推荐： java - 简单的框架。不要将一些变量序列化为xml

MarkLogic 森林无效的跨设备链接
我们正在运行 MarkLogic 9.0-11 版本 3 节点集群，并且 MarkLogic 安装在“/var/opt/MarkLogic/”目录中，我们创建了“/var/opt/MarkLogic/
javascript - 我如何弄平一片(森林)树木？
我有一片任意高度的森林，大致像这样: let data = [ { "id": 2, "name": "AAA", "parent_id": null, "short_name": "A" },
machine-learning - 何时使用回归树/森林？
已关闭。此问题不符合Stack Overflow guidelines 。目前不接受答案。这个问题似乎与 help center 中定义的范围内的编程无关。 . 已关闭 7 年前。 Improve
python - 将深度很大的嵌套字典(森林)写入文本文件
我有一个巨大的深度字典，代表森林(许多非二叉树)，我想处理森林并创建一个包含森林所有可能关系的文本文件，例如给定字典: {'a': {'b': {'c': {}, 'd': {}}, 'g': {}}
android - 获取android上某个位置的区域类型(森林/街道/水域)
在我的 Android 应用程序中，我包含了谷歌地图。现在我想获取有关您周围地区的信息。例如，你是在公园/森林/海滩……所以我基本上想要一个用“水”回答输入坐标 53°33'40.9"N 10°00'
sql-server-2008 - 多个层次结构(森林？)中的成员到一个表中
如果我有下表: Member_Key Member_Name col1 Mem1 col2 Mem2 col3 Mem3 col4
python - 将深度很大的嵌套字典(森林)写入 BFS 样式的文本文件
继续我的老问题: Writing nested dictionary (forest) of a huge depth to a text file 现在我想把森林遍历写成BFS风格:我有一个巨大的深
ssl - 如何使用单个 SSL 证书保护多域(Active Directory 森林)环境中的所有 Web 服务器？
我有一个多域环境(事件目录林)，例如subdomain1.mydomain.com, subdomain2.mydomain.com 其中 mydomain.com 是根 AD 域 (GC) 和 su
c# - 如何恢复具有地形类型(水、森林、平原..)Google/Bing map 的 2D map ？
我想知道是否有可能在 Google map 或 Bing Mag 2D/3D map 上恢复地形类型(山脉、森林、水域、平原等...) 。为了根据玩家在现实世界中的位置生成 map !我认为可用 AP

首页

博学

6Ren·AI

商城

随机森林回归 - 累积 MSE？