gpt4 book ai didi

r - 通过r中的for循环搜索最佳随机森林参数

转载 作者:行者123 更新时间:2023-11-30 08:43:45 25 4
gpt4 key购买 nike

大家好,我正在尝试通过for循环搜索最佳参数。然而,结果实在令我困惑。下面的代码应该提供相同的结果,因为参数“mtry”是相同的。

       gender Partner   tenure Churn
3521 Male No 0.992313 Yes
2525.1 Male No 4.276666 No
567 Male Yes 2.708050 No
8381 Female No 4.202127 Yes
6258 Female No 0.000000 Yes
6569 Male Yes 2.079442 No
27410 Female No 1.550804 Yes
6429 Female No 1.791759 Yes
412 Female Yes 3.828641 No
4655 Female Yes 3.737670 No
<小时/>
RFModel = randomForest(Churn ~ .,
data = ggg,
ntree = 30,
mtry = 2,
importance = TRUE,
replace = FALSE)
print(RFModel$confusion)

No Yes class.error
No 4 1 0.2
Yes 1 4 0.2
<小时/>
for(i in c(2)){
RFModel = randomForest(Churn ~ .,
data = Trainingds,
ntree = 30,
mtry = i,
importance = TRUE,
replace = FALSE)
print(RFModel$confusion)
}

No Yes class.error
No 3 2 0.4
Yes 2 3 0.4
<小时/>
  1. Code1 和 code2 应提供相同的输出。

最佳答案

每次您都会得到略有不同的结果,因为算法中内置了随机性。为了构建每棵树,算法对数据帧进行重新采样,并随机选择 mtry 列来根据重新采样的数据帧构建树。如果您希望使用相同参数(例如 mtry、ntree)构建的模型每次都给出相同的结果,则需要设置随机种子。

例如,让我们运行 randomForest 10 次,并检查每次运行的均方误差的平均值。请注意,平均 mse 每次都不同:

library(randomForest)

replicate(10, mean(randomForest(mpg ~ ., data=mtcars)$mse))
[1] 5.998530 6.307782 5.791657 6.125588 5.868717 5.845616 5.427208 6.112762 5.777624 6.150021

如果运行上面的代码,您将获得另外 10 个与上面的值不同的值。

如果您希望能够重现使用相同参数(例如 mtryntree)运行的给定模型的结果,那么您可以设置随机种子。例如:

set.seed(5)
mean(randomForest(mpg ~ ., data=mtcars)$mse)
[1] 6.017737

如果使用相同的种子值,您将得到相同的结果,否则会得到不同的结果。使用较大的 ntree 值会减少但不会消除模型运行之间的可变性。

更新:当我使用您提供的数据示例运行代码时,并不总是每次都能得到相同的结果。即使使用 replace=TRUE(这会导致在不进行替换的情况下对数据帧进行采样),每次选择用于构建树的列也可能不同:

> randomForest(Churn ~ .,
+ data = ggg,
+ ntree = 30,
+ mtry = 2,
+ importance = TRUE,
+ replace = FALSE)

Call:
randomForest(formula = Churn ~ ., data = ggg, ntree = 30, mtry = 2, importance = TRUE, replace = FALSE)
Type of random forest: classification
Number of trees: 30
No. of variables tried at each split: 2

OOB estimate of error rate: 30%
Confusion matrix:
No Yes class.error
No 3 2 0.4
Yes 1 4 0.2
> randomForest(Churn ~ .,
+ data = ggg,
+ ntree = 30,
+ mtry = 2,
+ importance = TRUE,
+ replace = FALSE)

Call:
randomForest(formula = Churn ~ ., data = ggg, ntree = 30, mtry = 2, importance = TRUE, replace = FALSE)
Type of random forest: classification
Number of trees: 30
No. of variables tried at each split: 2

OOB estimate of error rate: 20%
Confusion matrix:
No Yes class.error
No 4 1 0.2
Yes 1 4 0.2
> randomForest(Churn ~ .,
+ data = ggg,
+ ntree = 30,
+ mtry = 2,
+ importance = TRUE,
+ replace = FALSE)

Call:
randomForest(formula = Churn ~ ., data = ggg, ntree = 30, mtry = 2, importance = TRUE, replace = FALSE)
Type of random forest: classification
Number of trees: 30
No. of variables tried at each split: 2

OOB estimate of error rate: 30%
Confusion matrix:
No Yes class.error
No 3 2 0.4
Yes 1 4 0.2

以下是使用内置 iris 数据框得出的一组类似结果:

> randomForest(Species ~ ., data=iris, ntree=30, mtry=2, importance = TRUE,
+ replace = FALSE)

Call:
randomForest(formula = Species ~ ., data = iris, ntree = 30, mtry = 2, importance = TRUE, replace = FALSE)
Type of random forest: classification
Number of trees: 30
No. of variables tried at each split: 2

OOB estimate of error rate: 3.33%
Confusion matrix:
setosa versicolor virginica class.error
setosa 50 0 0 0.00
versicolor 0 47 3 0.06
virginica 0 2 48 0.04
> randomForest(Species ~ ., data=iris, ntree=30, mtry=2, importance = TRUE,
+ replace = FALSE)

Call:
randomForest(formula = Species ~ ., data = iris, ntree = 30, mtry = 2, importance = TRUE, replace = FALSE)
Type of random forest: classification
Number of trees: 30
No. of variables tried at each split: 2

OOB estimate of error rate: 4.67%
Confusion matrix:
setosa versicolor virginica class.error
setosa 50 0 0 0.00
versicolor 0 47 3 0.06
virginica 0 4 46 0.08
> randomForest(Species ~ ., data=iris, ntree=30, mtry=2, importance = TRUE,
+ replace = FALSE)

Call:
randomForest(formula = Species ~ ., data = iris, ntree = 30, mtry = 2, importance = TRUE, replace = FALSE)
Type of random forest: classification
Number of trees: 30
No. of variables tried at each split: 2

OOB estimate of error rate: 6%
Confusion matrix:
setosa versicolor virginica class.error
setosa 50 0 0 0.00
versicolor 0 47 3 0.06
virginica 0 6 44 0.12

您还可以查看每个模型运行生成的树,它们通常会有所不同。例如,假设我运行以下代码三次,将结果存储在对象 m1m2m3 中。

randomForest(Churn ~ .,
data = ggg,
ntree = 30,
mtry = 2,
importance = TRUE,
replace = FALSE)

现在让我们看看每个模型对象的前四棵树,我已将其粘贴在下面。输出是一个列表。您可以看到每个模型运行的第一棵树都是不同的。第二棵树对于前两次模型运行是相同的,但对于第三次模型运行是不同的,依此类推。

check.trees = lapply(1:4, function(i) {
lapply(list(m1=m1,m2=m2,m3=m3), function(model) getTree(model, i, labelVar=TRUE))
})

check.trees
[[1]]
[[1]]$m1
left daughter right daughter split var split point status prediction
1 2 3 Partner 1.000000 1 <NA>
2 4 5 gender 1.000000 1 <NA>
3 0 0 <NA> 0.000000 -1 No
4 0 0 <NA> 0.000000 -1 Yes
5 6 7 tenure 2.634489 1 <NA>
6 0 0 <NA> 0.000000 -1 Yes
7 0 0 <NA> 0.000000 -1 No

[[1]]$m2
left daughter right daughter split var split point status prediction
1 2 3 gender 1.000000 1 <NA>
2 0 0 <NA> 0.000000 -1 Yes
3 4 5 tenure 1.850182 1 <NA>
4 0 0 <NA> 0.000000 -1 Yes
5 0 0 <NA> 0.000000 -1 No

[[1]]$m3
left daughter right daughter split var split point status prediction
1 2 3 tenure 2.249904 1 <NA>
2 0 0 <NA> 0.000000 -1 Yes
3 0 0 <NA> 0.000000 -1 No


[[2]]
[[2]]$m1
left daughter right daughter split var split point status prediction
1 2 3 Partner 1 1 <NA>
2 0 0 <NA> 0 -1 Yes
3 0 0 <NA> 0 -1 No

[[2]]$m2
left daughter right daughter split var split point status prediction
1 2 3 Partner 1 1 <NA>
2 0 0 <NA> 0 -1 Yes
3 0 0 <NA> 0 -1 No

[[2]]$m3
left daughter right daughter split var split point status prediction
1 2 3 Partner 1 1 <NA>
2 4 5 gender 1 1 <NA>
3 0 0 <NA> 0 -1 No
4 0 0 <NA> 0 -1 Yes
5 0 0 <NA> 0 -1 No


[[3]]
[[3]]$m1
left daughter right daughter split var split point status prediction
1 2 3 Partner 1 1 <NA>
2 4 5 gender 1 1 <NA>
3 0 0 <NA> 0 -1 No
4 0 0 <NA> 0 -1 Yes
5 0 0 <NA> 0 -1 Yes

[[3]]$m2
left daughter right daughter split var split point status prediction
1 2 3 Partner 1 1 <NA>
2 0 0 <NA> 0 -1 Yes
3 0 0 <NA> 0 -1 No

[[3]]$m3
left daughter right daughter split var split point status prediction
1 2 3 tenure 2.129427 1 <NA>
2 0 0 <NA> 0.000000 -1 Yes
3 0 0 <NA> 0.000000 -1 No


[[4]]
[[4]]$m1
left daughter right daughter split var split point status prediction
1 2 3 tenure 1.535877 1 <NA>
2 0 0 <NA> 0.000000 -1 Yes
3 4 5 tenure 4.015384 1 <NA>
4 0 0 <NA> 0.000000 -1 No
5 6 7 tenure 4.239396 1 <NA>
6 0 0 <NA> 0.000000 -1 Yes
7 0 0 <NA> 0.000000 -1 No

[[4]]$m2
left daughter right daughter split var split point status prediction
1 2 3 Partner 1 1 <NA>
2 0 0 <NA> 0 -1 Yes
3 0 0 <NA> 0 -1 No

[[4]]$m3
left daughter right daughter split var split point status prediction
1 2 3 Partner 1 1 <NA>
2 0 0 <NA> 0 -1 Yes
3 0 0 <NA> 0 -1 No

关于r - 通过r中的for循环搜索最佳随机森林参数,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/42601339/

25 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com