gpt4 book ai didi

r - cforest varimp 似乎不适用于分类预测变量

转载 作者:行者123 更新时间:2023-12-01 14:46:04 25 4
gpt4 key购买 nike

我正在尝试使用 Party 包运行随机森林模型。我想使用 varimp 函数来确定条件变量的重要性,但它似乎不接受分类变量。这是一个link到我的数据,下面是我正在使用的代码。

> #set up dataframe
> bll = read.csv("bll_Nov2013.csv", header=TRUE)
> SB_Pres <- bll$Sandbar_Presence #binary presence/absnece
> Slope <-bll$Slope
> Dist2Shr <-bll$Dist2Shr
> Bathy <-bll$Bathy2
> Chla <-bll$GSM_Chl_Daily_MF
> SST <-bll$SST_PF_daily
> Region <- bll$Region
> MoonPhase <-bll$MoonPhase
> DaylightHours <- bll$DaylightHours
> bll_SB <- na.omit(data.frame(SB_Pres, Slope, Dist2Shr, Bathy, Chla, SST, DaylightHours, MoonPhase, Region))

> #run cforest model
> SBcf<- cforest(formula = factor(SB_Pres) ~ SST + Chla + Dist2Shr+ DaylightHours + Bathy + Slope + MoonPhase + factor(Region), data = bll_SB, control = cforest_unbiased())
> SBcf

Random Forest using Conditional Inference Trees

Number of trees: 500

Response: factor(SB_Pres)
Inputs: SST, Chla, Dist2Shr, DaylightHours, Bathy, Slope, MoonPhase, factor(Region)
Number of observations: 534

> #Varimp works if conditional = FALSE
> varimp(SBcf, conditional = FALSE)
SST Chla Dist2Shr DaylightHours Bathy Slope
0.024744898 0.084244898 0.015632653 0.009571429 0.006448980 0.003357143
MoonPhase factor(Region)
0.002724490 0.095000000


> #Varimp does NOT work if conditional = TRU
> varimp(SBcf, conditional = TRUE)
Error in model.frame.default(formula = ~SST + Chla + Dist2Shr + DaylightHours + :
variable lengths differ (found for 'factor(Region)')

如果我删除 factor(Region) 变量,则可以计算条件变量重要性。

这是带有分类预测变量的派对包 varimp 函数的已知行为吗?根据我的阅读,它应该能够处理分类预测变量 ( Conditional variable importance for random forests - Strobl et al ) - 它没有明确说明 varimp(obj, conditional = TRUE) 可以与分类预测变量一起使用。

任何见解将不胜感激!

谢谢,

丽莎

编辑:说明如果您在公式之外使用 as.factor 定义变量,则 as.factor 实际上不会生效 - 无论区域是否指定为因子,结果都是相同的。将这些结果与上面运行的其他 varimp (conditional = false) 进行比较,其中输出将变量显示为“factor(Region)”,而在下面它在两次运行中仅显示为“Region”。

> library("party")
> packageDescription("party")$Version
[1] "1.0-10"
> bll = read.csv("bll_SB.csv", header=TRUE)
> bll_SB <- na.omit(data.frame(bll))

> # region is specified as a factor
> bll_SB$SB_Pres <- factor(bll_SB$SB_Pres)
> bll_SB$Region <- factor(bll_SB$Region)
> set.seed(1)
> SBcf <- cforest(SB_Pres ~ ., data=bll_SB, control=cforest_unbiased())
> SBcf


Random Forest using Conditional Inference Trees

Number of trees: 500

Response: SB_Pres
Inputs: Slope, Dist2Shr, Bathy, Chla, SST, DaylightHours, MoonPhase, Region
Number of observations: 534

> system.time(res1 <- varimp(SBcf, conditional = FALSE))
user system elapsed
4.466 0.013 4.480
> res1
Slope Dist2Shr Bathy Chla SST DaylightHours
0.003632653 0.015908163 0.008285714 0.085367347 0.028846939 0.009520408
MoonPhase Region
0.002969388 0.093061224


> # Run again, region is not specified as a factor
> bll_SB$Region <- bll_SB$Region
> set.seed(1)
> SBcf <- cforest(SB_Pres ~ ., data=bll_SB, control=cforest_unbiased())
> system.time(res2 <- varimp(SBcf, conditional = FALSE))
user system elapsed
4.562 0.015 4.578
> res2
Slope Dist2Shr Bathy Chla SST DaylightHours
0.003632653 0.015908163 0.008285714 0.085367347 0.028846939 0.009520408
MoonPhase Region
0.002969388 0.093061224

最佳答案

我无法在您的示例中观察到问题。我能够使用以下代码计算您的数据集的条件变量重要性:

R> library("party")
R> packageDescription("party")$Version
[1] "1.0-10"

R> bll = read.csv("bll_SB.csv", header=TRUE)
R>
R> bll_SB <- na.omit(data.frame(bll))
R> bll_SB$SB_Pres <- factor(bll_SB$SB_Pres)
R> bll_SB$Region <- factor(bll_SB$Region)
R>
R> set.seed(1)
R> SBcf <- cforest(SB_Pres ~ ., data=bll_SB, control=cforest_unbiased())
R> SBcf
#
# Random Forest using Conditional Inference Trees
#
# Number of trees: 500
#
# Response: SB_Pres
# Inputs: Slope, Dist2Shr, Bathy, Chla, SST, DaylightHours, MoonPhase, Region
# Number of observations: 534

R> system.time(res1 <- varimp(SBcf, conditional = FALSE))
# user system elapsed
# 5.971 0.012 5.994
R> system.time(res2 <- varimp(SBcf, conditional = TRUE))
# user system elapsed
# 2704.1 58.2 2768.0
R> res1
# Slope Dist2Shr Bathy Chla SST
# 0.003633 0.015908 0.008286 0.085367 0.028847
# DaylightHours MoonPhase Region
# 0.009520 0.002969 0.093061
R> res2
# Slope Dist2Shr Bathy Chla SST
# -6.122e-05 2.449e-03 -4.082e-05 1.004e-02 3.367e-03
# DaylightHours MoonPhase Region
# 5.714e-04 6.735e-04 1.067e-02

关于r - cforest varimp 似乎不适用于分类预测变量,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/20343974/

25 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com