gpt4 book ai didi

r - 为什么 R 中不同的随机森林实现会产生不同的结果?

转载 作者:行者123 更新时间:2023-11-30 09:17:16 25 4
gpt4 key购买 nike

我承认,除了编写这些问题的人之外,向任何人提出这个问题都有点困难,但我在 R 中的三个不同版本的随机森林中获得了持续不同的结果。

这三个方法是 randomForest 包、caret 中的“rf”方法和 ranger 包。代码包含在下面。

相关数据只是一个例子;我在类似数据的其他规范中看到了类似的情况。

LHS 变量:政党标识(民主党、众议员、独立党)。右手边的预测因素是人口统计数据。为了试图弄清楚一些 bizarre results in the randomForest package 到底发生了什么,我尝试在其他两种方法中实现相同的模型。我发现他们不会重现那个特定的异常现象;这特别奇怪,因为据我所知,caret 中的 rf 方法只是 randomForest 包的间接使用。

我在每个实现中运行的三个规范是(1)三类别分类,(2)删除独立类别,以及(3)与 2 相同,但将单个观察扰乱为“独立”以保留三个类别模型,它应该产生与 2 类似的结果。据我所知,在任何情况下都不应该有任何过度或不足的采样来解释结果。

我还注意到以下趋势:

  1. randomForest 软件包是唯一一个完全困惑的软件包,只有两个类别。
  2. 护林员包始终将更多观察结果识别为独立个体(无论是正确还是错误)。
  3. 就整体预测准确性而言,护林员套件总是稍差一些。
  4. caret 包的整体准确度与 randomForest 相似(稍高),但在较常见的类上始终表现较好,在较不常见的类上较差。这很奇怪,因为据我所知,在这两种情况下我都没有实现任何过采样或欠采样,并且因为我认为插入符依赖于 randomForest 包。

下面我包含了代码和混淆矩阵,显示了相关差异。每次重新运行代码都会在混淆矩阵中产生类似的趋势;这不是一个“任何单独的运行都可能产生奇怪结果”的问题。

有谁知道为什么这些包会始终产生略有不同(并且在 randomForest 中的链接问题的情况下,非常不同)的结果,或者甚至更好,为什么它们会以这种特定方式不同?例如,我应该注意这些包的包中是否存在某种样本加权/分层?

代码:

num_trees=1001
var_split=3

load("three_cat.Rda")
rf_three_cat <-randomForest(party_id_3_cat~{RHS Vars},
data=three_cat,
ntree=num_trees,
mtry=var_split,
type="classification",
importance=TRUE,confusion=TRUE)

two_cat<-subset(three_cat,party_id_3_cat!="2. Independents")
two_cat$party_id_3_cat<-droplevels(two_cat$party_id_3_cat)
rf_two_cat <-randomForest(party_id_3_cat~{RHS Vars},
data=two_cat,
ntree=num_trees,
mtry=var_split,
type="classification",
importance=TRUE,confusion=TRUE)
scramble_independent<-subset(three_cat,party_id_3_cat!="2. Independents")
scramble_independent[1,19]<-"2. Independents"
scramble_independent<- data.frame(lapply(scramble_independent, as.factor), stringsAsFactors=TRUE)
rf_scramble<-randomForest(party_id_3_cat~{RHS Vars},
data=scramble_independent,
ntree=num_trees,
mtry=var_split,
type="classification",
importance=TRUE,confusion=TRUE)

ranger_2<-ranger(formula=party_id_3_cat~{RHS Vars},
data=two_cat,
num.trees=num_trees,mtry=var_split)
ranger_3<-ranger(formula=party_id_3_cat~{RHS Vars},
data=three_cat,
num.trees=num_trees,mtry=var_split)
ranger_scram<-ranger(formula=party_id_3_cat~{RHS Vars},
data=scramble_independent,
num.trees=num_trees,mtry=var_split)

rfControl <- trainControl(method = "none", number = 1, repeats = 1)
rfGrid <- expand.grid(mtry = c(3))
rf_caret_3 <- train(party_id_3_cat~{RHS Vars},
data=three_cat,
method="rf", ntree=num_trees,
type="classification",
importance=TRUE,confusion=TRUE,
trControl = rfControl, tuneGrid = rfGrid)
rf_caret_2 <- train(party_id_3_cat~{RHS Vars},
data = two_cat,
method = "rf",ntree=num_trees,
type="classification",
importance=TRUE,confusion=TRUE,
trControl = rfControl, tuneGrid = rfGrid)
rf_caret_scramble <- train(party_id_3_cat~{RHS Vars},
data = scramble_independent,
method = "rf",ntree=num_trees,
type="classification",
importance=TRUE,confusion=TRUE,
trControl = rfControl, tuneGrid = rfGrid)

rf_three_cat$confusion
ranger_3$confusion.matrix
rf_caret_3$finalModel["confusion"]

rf_two_cat$confusion
ranger_2$confusion.matrix
rf_caret_2$finalModel["confusion"]

rf_scramble$confusion
ranger_scram$confusion.matrix
rf_caret_scramble$finalModel["confusion"]

结果(为便于比较,格式略有修改):

> rf_three_cat$confusion
1. Democrats (including leaners) 2. Independents 3. Republicans (including leaners) class.error
1. Democrats (including leaners) 1121 3 697 0.3844042
2. Independents 263 7 261 0.9868173
3. Republicans (including leaners) 509 9 1096 0.3209418

> ranger_3$confusion.matrix
1. Democrats (including leaners) 2. Independents 3. Republicans (including leaners) class.error
1. Democrats (including leaners) 1128 46 647 0.3805601
2. Independents 263 23 245 0.9566855
3. Republicans (including leaners) 572 31 1011 0.3736059

> rf_caret_3$finalModel["confusion"]
1. Democrats (including leaners) 2. Independents 3. Republicans (including leaners) class.error
1. Democrats (including leaners) 1268 0 553 0.3036793
2. Independents 304 0 227 1.0000000
3. Republicans (including leaners) 606 0 1008 0.3754647

> rf_two_cat$confusion
1. Democrats (including leaners) 3. Republicans (including leaners) class.error
1. Democrats (including leaners) 1775 46 0.0252608
3. Republicans (including leaners) 1581 33 0.9795539

> ranger_2$confusion.matrix
1. Democrats (including leaners) 3. Republicans (including leaners) class.error
1. Democrats (including leaners) 1154 667 0.3662823
3. Republicans (including leaners) 590 1024 0.3655514

> rf_caret_2$finalModel["confusion"]
1. Democrats (including leaners) 3. Republicans (including leaners) class.error
1. Democrats (including leaners) 1315 506 0.2778693
3. Republicans (including leaners) 666 948 0.4126394

> rf_scramble$confusion
1. Democrats (including leaners) 2. Independents 3. Republicans (including leaners) class.error
1. Democrats (including leaners) 1104 0 717 0.3937397
2. Independents 0 0 1 1.0000000
3. Republicans (including leaners) 501 0 1112 0.3106014

> ranger_scram$confusion.matrix
1. Democrats (including leaners) 2. Independents 3. Republicans (including leaners)
1. Democrats (including leaners) 1159 0 662 0.3635365
2. Independents 0 0 1 1.0000000
3. Republicans (including leaners) 577 0 1036 0.3577185

> rf_caret_scramble$finalModel["confusion"]
1. Democrats (including leaners) 2. Independents 3. Republicans (including leaners) class.error
1. Democrats (including leaners) 1315 0 506 0.2778693
2. Independents 0 0 1 1.0000000
3. Republicans (including leaners) 666 0 947 0.4128952

最佳答案

首先,随机森林算法是......随机的,因此默认情况下会出现一些变化。其次,更重要的是,算法不同,即它们使用不同的步骤,这就是为什么你会得到不同的结果。

您应该看看他们如何执行分割(哪个标准:基尼、额外等),如果这些是随机的(极其随机的树),他们如何对引导样本进行采样(有/没有替换)以及比例是多少、mtry 或每次分割时选择多少个变量、节点中的最大深度或最大情况等。

关于r - 为什么 R 中不同的随机森林实现会产生不同的结果?,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/52263749/

25 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com