gpt4 book ai didi

r - 提升回归树 - 偏差值

转载 作者:行者123 更新时间:2023-12-04 10:29:11 25 4
gpt4 key购买 nike

我正在使用 R 中的 gbm 包为以下模型拟合 BRT 模型:

离地高度 ~ 年龄 + 季节 + 栖息地 + 时间

地面以上的高度是一个连续变量,时间也是如此。季节和栖息地是二项式变量。

我得到了非常高的偏差,我不知道为什么......
有人可以帮我设置参数吗?

> M1 <- gbm.step(data=data, gbm.x = 2:5, gbm.y = 1,
+ family = "gaussian", tree.complexity = 4,
+ learning.rate = 0.01, bag.fraction = 0.50,
+ tolerance.method = "fixed",
+ tolerance = 0.01)


GBM STEP - version 2.9

Performing cross-validation optimisation of a boosted regression tree model
for HAG and using a family of gaussian
Using 15439 observations and 4 predictors
creating 10 initial models of 50 trees

folds are unstratified
total mean deviance = 55368.22
tolerance is fixed at 0.01
ntrees resid. dev.
50 51050.65
now adding trees...
100 48935.65
150 47805.14
200 47193.43
250 46841.71
300 46631.33
350 46498.56
400 46418.58
450 46371.7
500 46336.54
550 46317.53
600 46309.25
650 46300.57
700 46296.82
750 46297
800 46299.11
850 46297.7
900 46298.34
950 46292.32
1000 46297.62
1050 46295.78
1100 46301.32
1150 46306.59
1200 46312.55
1250 46314.67
1300 46318.64
1350 46321.38
1400 46324.33
1450 46322.9
fitting final gbm model with a fixed number of 950 trees for HAG

mean total deviance = 55368.21
mean residual deviance = 45913.34

estimated cv deviance = 46292.32 ; se = 1366.501

training data correlation = 0.413
cv correlation = 0.406 ; se = 0.008

elapsed time - 0.02 minutes

最佳答案

gbm 中的偏差是均方误差,它将取决于您的因变量所在的比例。

例如:

library(dismo)
library(mlbench)
data(BostonHousing)
idx=sample(nrow(BostonHousing),400)
TrnData = BostonHousing[idx,]
TestData = BostonHousing[-idx,]

因变量是最后一列 "medv",因此我们对原始数据运行 gbm:
gbm_0 = gbm.step(data=TrnData,gbm.x=1:13,gbm.y=14,family="gaussian")

mean total deviance = 84.02
mean residual deviance = 7.871

estimated cv deviance = 13.959 ; se = 1.909

training data correlation = 0.952
cv correlation = 0.916 ; se = 0.012

您可以看到平均偏差也可以从您的残差(即 y - y 预测)中计算出来:
mean(gbm_0$residuals^2)
[1] 7.871158

使用 testData(模型尚未经过训练)总是好的。您还可以使用相关性或 MAE(平均绝对误差)检查它与实际数据的接近程度:
pred = predict(gbm_0,TestData,1000)    
# or pearson if you like
cor(pred,TestData$medv,method="spearman")
[1] 0.8652737
# MAE
mean(abs(TestData$medv-pred))
[1] 2.75325

可视化它,良好的相关性是有意义的,您的预测平均偏离 3。

enter image description here

现在,如果您更改因变量的比例,则您对相关性或 MAE 的解释所导致的偏差将保持不变:
TrnData$medv = TrnData$medv*2
TestData$medv = TestData$medv*2
gbm_2 = gbm.step(data=TrnData,gbm.x=1:13,gbm.y=14,family="gaussian")

mean total deviance = 336.081
mean residual deviance = 30.983

estimated cv deviance = 57.52 ; se = 10.254

training data correlation = 0.953
cv correlation = 0.911 ; se = 0.019

elapsed time - 0.2 minutes

pred = predict(gbm_2,TestData,1000)
cor(pred,TestData$medv,method="spearman")
[1] 0.8676821
mean(abs(TestData$medv-pred))
[1] 5.47673

关于r - 提升回归树 - 偏差值,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/60488587/

25 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com