gpt4 book ai didi

r - 分层抽样似乎不会改变 randomForest 结果

转载 作者:行者123 更新时间:2023-12-01 10:51:06 29 4
gpt4 key购买 nike

我在 R 中使用 randomForest 包来构建几个物种分布模型。我的响应变量是二元的(0 - 缺席或 1 存在),并且非常不平衡 - 对于某些物种,缺席:存在的比率是 37:1。这种不平衡(或零通胀)导致了有问题的袋外误差估计——缺席与存在的比率越大,我的袋外 (OOB) 误差估计越低。

为了弥补这种不平衡,我想实现分层抽样,以便随机森林中的每棵树都包含相等(或至少不平衡)数量的来自存在和不存在类别的结果。我很惊讶分层和未分层模型的 OOB 误差估计似乎没有任何差异。请参阅下面的代码:

无分层

> set.seed(25)
> HHrf<- randomForest(formula = factor(HH_Pres) ~ SST + Dist2Shr+ DaylightHours + Bathy + Slope + MoonPhase + factor(Region) + Chla , data = bll_HH, ntree = 500, replace = FALSE, importance = TRUE, na.action = na.omit)
> HHrf
Call:
randomForest(formula = factor(HH_Pres) ~ SST + Dist2Shr + DaylightHours + Bathy + Slope + MoonPhase + factor(Region) + Chla, data = bll_HH, ntree = 500, replace = FALSE, importance = TRUE, na.action = na.omit)
Type of random forest: classification
Number of trees: 500
No. of variables tried at each split: 2

OOB estimate of error rate: 19.1%
Confusion matrix:
0 1 class.error
0 422 18 0.04090909
1 84 10 0.89361702

有分层
> HHrf_strata<- randomForest(formula = factor(HH_Pres) ~ SST + Chla + Dist2Shr+ DaylightHours + Bathy + Slope + MoonPhase + factor(Region), data = bll_HH, ntree = 500, replace = FALSE, importance = TRUE, strata = bll_HH$HH_Pres, sampsize = ceiling(.632*nrow(bll_HH)))
> HHrf

Call:
randomForest(formula = factor(HH_Pres) ~ SST + Dist2Shr + DaylightHours + Bathy + Slope + MoonPhase + factor(Region) + Chla, data = bll_HH, ntree = 500, replace = FALSE, importance = TRUE, na.action = na.omit)
Type of random forest: classification
Number of trees: 500
No. of variables tried at each split: 2

OOB estimate of error rate: 19.1%
Confusion matrix:
0 1 class.error
0 422 18 0.04090909
1 84 10 0.89361702

是否有理由在两种情况下我都得到相同的结果?对于strata 参数,我指定了我的响应变量HH_Pres。对于 sampsize 参数,我指定它应该只是整个数据集的 63.2%。

有谁知道我做错了什么?或者这是在意料之中?

谢谢,

丽莎

要重现此问题:

样本数据: https://docs.google.com/file/d/0B-JMocik79JzY3B4U3NoU3kyNW8/edit

代码:
bll = read.csv("bll_Nov2013_NMV.csv", header=TRUE)
HH_Pres <- bll$HammerHeadALL_Presence

Slope <-bll$Slope
Dist2Shr <-bll$Dist2Shr
Bathy <-bll$Bathy2
Chla <-bll$GSM_Chl_Daily_MF
SST <-bll$SST_PF_daily
Region <- bll$Region
MoonPhase <-bll$MoonPhase
DaylightHours <- bll$DaylightHours
bll_HH <- data.frame(HH_Pres, Slope, Dist2Shr, Bathy, Chla, SST, DaylightHours, MoonPhase, Region)
set.seed(25)

HHrf<- randomForest(formula = factor(HH_Pres) ~ SST + Dist2Shr+ DaylightHours + Bathy + Slope + MoonPhase + factor(Region) + Chla , data = bll_HH, ntree = 500, replace = FALSE, importance = TRUE, na.action = na.omit)
HHrf
set.seed(25)
HHrf_strata<- randomForest(formula = factor(HH_Pres) ~ SST + Chla + Dist2Shr+ DaylightHours + Bathy + Slope + MoonPhase + factor(Region), data = bll_HH, strata = bll_HH$HH_Pres, sampsize = c(100, 50), ntree = 500, replace = FALSE, importance = TRUE)
HHrf

最佳答案

据我所知,sampsize参数应该是一个向量,它的长度与数据集中的类数相同。如果您在 strata 中指定因子变量参数,然后 sampsize应该给定一个长度与 strata 中的因子数相同的向量。争论。我不确定它是否像您在问题中描述的那样执行,但是我已经有一段时间没有使用 randomForest功能。

从帮助文件中,它说:

strata

A (factor) variable that is used for stratified sampling.

sampsize:

Size(s) of sample to draw. For classification, if sampsize is a vector of the length the number of strata, then sampling is stratified by strata, and the elements of sampsize indicate the numbers to be drawn from the strata.



例如,由于您的分类有 2 个不同的类,您需要给出 sampsize一个长度为 2 的向量,它指定在训练期间要从每个类中采样的观察值数量。

例如 sampsize=c(100,50)
此外,您可以指定组的名称以使其更加清晰。

例如 sampsize=c('0'=100, '1'=50)
使用 sampsize 的帮助文件中的示例论证,澄清:
## stratified sampling: draw 20, 30, and 20 of the species to grow each tree.
data(iris)
(iris.rf2 <- randomForest(iris[1:4], iris$Species, sampsize=c(20, 30, 20)))

编辑:添加了一些关于 strata 的注释参数在 randomForest .

编辑:确保 strata参数被赋予一个因子变量!

例如试试 strata = factor(HH_Pres), sampsize = c(...)哪里 c(...)是一个与 length(levels(factor(bll_HH$HH_Pres))) 长度相同的向量

编辑:

好的,我试着用你的数据运行代码,它对我有用。
# Fix up the data set to have HH_Pres and Region as factors
bll_HH$Region <- factor(bll_HH$Region)
bll_HH$HH_Pres <- factor(bll_HH$HH_Pres)

# Original RF code
set.seed(25)
HHrf <- randomForest(formula=HH_Pres ~ SST + Dist2Shr + DaylightHours + Bathy +
Slope + MoonPhase + Chla + Region,
data=bll_HH, ntree = 500, replace = FALSE,
importance = TRUE, na.action = na.omit)
HHrf

# Output
# OOB estimate of error rate: 18.91%
# Confusion matrix:
# 0 1 class.error
# 0 425 15 0.03409091
# 1 86 8 0.91489362

# Take 63.2% from each class
mySampSize <- ceiling(table(bll_HH$HH_Pres) * 0.632)

set.seed(25)
HHrf <- randomForest(formula=HH_Pres ~ SST + Dist2Shr + DaylightHours + Bathy +
Slope + MoonPhase + Chla + Region,
data=bll_HH, ntree = 500, replace = FALSE,
importance = TRUE, na.action = na.omit,
sampsize=mySampSize)
HHrf
# Output
# OOB estimate of error rate: 18.91%
# Confusion matrix:
# 0 1 class.error
# 0 424 16 0.03636364
# 1 85 9 0.90425532

请注意,在这种情况下,OOB 错误估计是相同的,即使我们仅使用来自 bootstrap 样本中每个类的 63.2% 的数据。这可能是由于使用的样本大小与训练数据中的类分布成正比,并且数据集的大小相对较小。让我们试着改变 mySampSize以确保它真的有效。
# Change mySampSize. Sample 100 from class 0 and 50 from class 1
mySampSize[1] <- 100
mySampSize[2] <- 50

set.seed(25)
HHrf <- randomForest(formula=HH_Pres ~ SST + Dist2Shr + DaylightHours + Bathy +
Slope + MoonPhase + Chla + Region,
data=bll_HH, ntree = 500, replace = FALSE,
importance = TRUE, na.action = na.omit,
sampsize=mySampSize)
HHrf
# Output
# OOB estimate of error rate: 21.16%
# Confusion matrix:
# 0 1 class.error
# 0 382 58 0.1318182
# 1 55 39 0.5851064

关于r - 分层抽样似乎不会改变 randomForest 结果,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/20150525/

29 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com