gpt4 book ai didi

r - R 中随机森林的分层抽样

转载 作者:行者123 更新时间:2023-12-02 05:29:16 31 4
gpt4 key购买 nike

我在randomForest的文档中阅读了以下内容:

strata: A (factor) variable that is used for stratified sampling.

sampsize: Size(s) of sample to draw. For classification, if sampsize is a vector of the length the number of strata, then sampling is stratified by strata, and the elements of sampsize indicate the numbers to be drawn from the strata.

作为引用,该函数的接口(interface)如下:

 randomForest(x, y=NULL,  xtest=NULL, ytest=NULL, ntree=500,
mtry=if (!is.null(y) && !is.factor(y))
max(floor(ncol(x)/3), 1) else floor(sqrt(ncol(x))),
replace=TRUE, classwt=NULL, cutoff, strata,
sampsize = if (replace) nrow(x) else ceiling(.632*nrow(x)),
nodesize = if (!is.null(y) && !is.factor(y)) 5 else 1,
maxnodes = NULL,
importance=FALSE, localImp=FALSE, nPerm=1,
proximity, oob.prox=proximity,
norm.votes=TRUE, do.trace=FALSE,
keep.forest=!is.null(y) && is.null(xtest), corr.bias=FALSE,
keep.inbag=FALSE, ...)

我的问题是:究竟如何使用stratasampsize?这是一个最小的工作示例,我想在其中测试这些参数:

library(randomForest)
iris = read.table("http://archive.ics.uci.edu/ml/machine-learning-databases/iris/iris.data", sep = ",", header = FALSE)
names(iris) = c("sepal.length", "sepal.width", "petal.length", "petal.width", "iris.type")

model = randomForest(iris.type ~ sepal.length + sepal.width, data = iris)

> model
500 samples
6 predictors
2 classes: 'Y0', 'Y1'

No pre-processing
Resampling: Bootstrap (7 reps)

Summary of sample sizes: 477, 477, 477, 477, 477, 477, ...

Resampling results across tuning parameters:

mtry ROC Sens Spec ROC SD Sens SD Spec SD
2 0.763 1 0 0.156 0 0
4 0.782 1 0 0.231 0 0
6 0.847 1 0 0.173 0 0

ROC was used to select the optimal model using the largest value.
The final value used for the model was mtry = 6.

我之所以使用这些参数,是因为我希望 RF 使用尊重数据中阳性与阴性比例的引导样本。

This other thread ,开始了有关该主题的讨论,但在没有明确如何使用这些参数的情况下就解决了。

最佳答案

这不就是这样吗:

model = randomForest(iris.type ~ sepal.length + sepal.width, 
data = iris,
sampsize=c(10,10,10), strata=iris$iris.type)

我确实尝试了 ..., strata=iristype..., strata='iristype' 但显然代码不是为了解释该值而编写的“数据”论证的环境。我使用结果变量是因为它是该数据集中唯一的因素变量,但我认为它不需要是结果变量。事实上,我认为它绝对不应该是结果变量。该特定模型预计会产生无用的输出,并且仅用于测试语法。

关于r - R 中随机森林的分层抽样,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/14842059/

31 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com