gpt4 book ai didi

python - 具有单独训练和验证集的 GridSearchCV 错误地还考虑了最终选择最佳模型的训练结果

转载 作者:行者123 更新时间:2023-11-30 09:05:48 29 4
gpt4 key购买 nike

我有一个包含 3500 个观察值 x 70 个特征的数据集,这是我的训练集,我还有一个包含 600 个观察值 x 70 个特征的数据集,这是我的验证集。目标是将观测值正确分类为 0 或 1。

我使用 Xgboost,目标是在分类阈值 = 0.5 时获得尽可能高的精度。

我正在进行网格搜索:

import numpy as np
import pandas as pd
import xgboost

# Import datasets from edge node
data_train = pd.read_csv('data.csv')
data_valid = pd.read_csv('data_valid.csv')

# Specify 'data_test' as validation set for the Grid Search below
from sklearn.model_selection import PredefinedSplit
X, y, train_valid_indices = train_valid_merge(data_train, data_valid)
train_valid_merge_indices = PredefinedSplit(test_fold=train_valid_indices)

# Define my own scoring function to see
# if it is called for both the training and the validation sets
from sklearn.metrics import make_scorer
custom_scorer = make_scorer(score_func=my_precision, greater_is_better=True, needs_proba=False)

# Instantiate xgboost
from xgboost.sklearn import XGBClassifier
classifier = XGBClassifier(random_state=0)

# Small parameters' grid ONLY FOR START
# I plan to use way bigger parameters' grids
parameters = {'n_estimators': [150, 175, 200]}

# Execute grid search and retrieve the best classifier
from sklearn.model_selection import GridSearchCV
classifiers_grid = GridSearchCV(estimator=classifier, param_grid=parameters, scoring=custom_scorer,
cv=train_valid_merge_indices, refit=True, n_jobs=-1)
classifiers_grid.fit(X, y)

...................................................... ................................

train_valid_merge - 指定我自己的验证集:

我想使用我的训练集 (data_train) 对每个模型进行训练,并使用我的不同/单独的验证集 (data_valid) 进行超参数调整。因此,我定义了一个名为 train_valid_merge 的函数,它将我的训练集和验证集连接起来,以便可以将它们提供给 GridSeachCV,并且我还使用了 PredefineSplit code> 指定此合并集中哪个是训练集,哪个是验证集:

def train_valid_merge(data_train, data_valid):

# Set test_fold values to -1 for training observations
train_indices = [-1]*len(data_train)

# Set test_fold values to 0 for validation observations
valid_indices = [0]*len(data_valid)

# Concatenate the indices for the training and validation sets
train_valid_indices = train_indices + valid_indices

# Concatenate data_train & data_valid
import pandas as pd
data = pd.concat([data_train, data_valid], axis=0, ignore_index=True)
X = data.iloc[:, :-1].values
y = data.iloc[:, -1].values
return X, y, train_valid_indices

...................................................... ................................

custom_scorer - 指定我自己的评分指标:

我定义了自己的评分函数,它只是返回精度,只是为了看看训练集和验证集是否都调用了它:

def my_precision(y_true, y_predict):

# Check length of 'y_true' to see if it is the training or the validation set
print(len(y_true))

# Calculate precision
from sklearn.metrics import precision_score
precision = precision_score(y_true, y_predict, average='binary')

return precision

...................................................... ................................

当我运行整个过程时(对于 parameters = {'n_estimators': [150, 175, 200]}),则从 print(len(y_true ))my_ precision 函数中:

600
600
3500
600
3500
3500

这意味着训练集和验证集都会调用评分函数。但我已经测试过,评分函数不仅被调用,而且来自训练集和验证集的结果也用于确定网格搜索中的最佳模型(即使我已指定它仅使用验证集结果)。

例如,使用我们的 3 个参数值 ('n_estimators': [150, 175, 200]),它会考虑训练集和验证集(2 组)的分数,因此它产生(3个参数)x(2组)= 6个不同的网格结果。因此,它从所有这些网格结果中挑选出最佳的超参数集,因此它最终可能会从训练集的结果中挑选出一个超参数集,而我只想考虑验证集(3 个结果)。

但是,如果我向 my_ precision 函数添加类似的内容来绕过训练集(通过将其所有精度值设置为 0):

# Remember that the training set has 3500 observations
# and the validation set 600 observations
if(len(y_true>600)):
return 0

然后(据我测试)我当然得到了适合我的规范的最佳模型,因为训练集精度结果太小,因为它们都是 0 到。

我的问题如下:

为什么自定义评分函数会考虑训练集和验证集来挑选最佳模型,而我已使用 train_valid_merge_indices 指定网格搜索的最佳模型应该是根据验证集选择?

当模型的选择和排名完成时,如何使 GridSearchCV 只考虑验证集和模型的得分?

最佳答案

I have one distinct training set and one distinct validation set. I want to train my model on the training set and find the best hyperparameters based on its performance on my distinct validation set.

那么您肯定既不需要 PredefinedSplit 也不需要 GridSearchCV:

import pandas as pd
from xgboost.sklearn import XGBClassifier
from sklearn.metrics import precision_score

# Import datasets from edge node
data_train = pd.read_csv('data.csv')
data_valid = pd.read_csv('data_valid.csv')

# training data & labels:
X = data_train.iloc[:, :-1].values
y = data_train.iloc[:, -1].values

# validation data & labels:
X_valid = data_valid.iloc[:, :-1].values
y_true = data_valid.iloc[:, -1].values

n_estimators = [150, 175, 200]
perf = []

for k_estimators in n_estimators:
clf = XGBClassifier(n_estimators=k_estimators, random_state=0)
clf.fit(X, y)

y_predict = clf.predict(X_valid)
precision = precision_score(y_true, y_predict, average='binary')
perf.append(precision)

perf将包含验证集上各自分类器的性能...

关于python - 具有单独训练和验证集的 GridSearchCV 错误地还考虑了最终选择最佳模型的训练结果,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/52579293/

29 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com