gpt4 book ai didi

python - 使用 GridsearchCV 调整参数未给出最佳结果

转载 作者:行者123 更新时间:2023-11-30 09:02:53 25 4
gpt4 key购买 nike

我正在尝试调整梯度增强回归器的参数。

首先,仅考虑 n_estimators,使用 staged_predict 方法获得最佳 n_estimators,我得到 RMSE = 4.84 。

X_train, X_test, y_train, y_test = train_test_split(X,y,test_size=0.3, random_state=0)

gbr_onehot = GradientBoostingRegressor(
n_estimators = 1000,
learning_rate = 0.1,
random_state = 214
)
model = gbr_onehot.fit(X_train, y_train)

errors = [mean_squared_error(y_test, y_pred)
for y_pred in gbr_onehot.staged_predict(X_test)]

best_num_trees =np.argmin(errors)

GBR_best_num_trees_onehot = GradientBoostingRegressor(
n_estimators =best_num_trees,
learning_rate = 0.1,
random_state = 214
)

best_num_tree_model = GBR_best_num_trees_onehot.fit(X_train, y_train)
y_pred = GBR_best_num_trees_onehot.predict(X_test)
print(best_num_trees)
print(f'RMSE with label encoding (best_num_trees) = {np.sqrt(metrics.mean_squared_error(y_test, y_pred))}')


>>> 596
>>> RMSE with label encoding (best_num_trees) = 4.849497587420823

或者,这次我使用 GridsearchCV 调整了每棵树的 n_estimator、learning_rate 和 max_depth。

首先,调整n_estimator和learning_rate:

def rmse(actual, predict):
predict = np.array(predict)
actual = np.array(actual)

distance = predict - actual

square_distance = distance ** 2

mean_square_distance = square_distance.mean()

score = np.sqrt(mean_square_distance)

return score

rmse_score = make_scorer(rmse, greater_is_better=False)

p_test = {
'learning_rate': [0.15,0.1,0.05,0.01,0.005,0.001],
'n_estimators' : [100,250,500,750,1000,1250,1500,1750]
}



tuning = GridSearchCV(estimator=GradientBoostingRegressor(max_depth=3,
min_samples_split=2,
min_samples_leaf=1,
subsample=1,
max_features='sqrt',
random_state=214),
param_grid = p_test,
scoring = rmse_score,
n_jobs = 4,
iid=False,
cv=5)

tuning.fit(X_train, y_train)

然后使用来自tuning.best_params_

的值
p_test_2 = {'max_depth':[2,3,4,5,6,7]}
tuning = GridSearchCV(estimator = GradientBoostingRegressor(learning_rate=0.05,
n_estimators=1000,
min_samples_split=2,
min_samples_leaf=1,
max_features='sqrt',
random_state=214),
param_grid = p_test_2,
scoring = rmse_score,
n_jobs=4,
iid=False,
cv=5)

tuning.fit(X_train, y_train)

用于获取最佳max_深度参数。

插入从上面收到的参数并进行测试后

model = GradientBoostingRegressor(
learning_rate=0.1,
n_estimators=1000,
min_samples_split=2,
min_samples_leaf=1,
max_features='sqrt',
random_state=214,
max_depth=3
)

model.fit(X_train, y_train)

y_pred = model.predict(X_test)

print(f'RMSE = {np.sqrt(metrics.mean_squared_error(y_test, y_pred))}')

>>> RMSE = 4.876534569535954

它的 RMSE 比我仅使用 staged_predict 得到的要高。为什么会这样呢?另外,当我打印(tuning.best_score_)时,为什么它返回负值?

最佳答案

呵呵,就这么简单。当您在训练数据上获得最佳拟合参数时,您尝试比较测试数据的 RMSE 指标。它必须是具有不同质量值的不同数据集。如果您根据训练数据计算 RMSE - 您应该获得具有最佳拟合参数的更好质量的回归器。

[更新]

为了更好地理解,请看一下图片: enter image description here

这里的模型复杂度对应于您的一些调整参数(最大深度等),预测误差类似于您的 RMSE 测量以及根据您的训练和测试数据集的两条曲线。因此,当您使用 GridSearchCV 搜索最适合的参数时 - 您正在沿着训练曲线向下移动并在高位置附近获得一个 RMSE 值,但这是危险的原因 overfitting但是,测试样本的 RMSE 并不是最佳的。

关于python - 使用 GridsearchCV 调整参数未给出最佳结果,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/59607441/

25 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com