gpt4 book ai didi

python - 为什么我不能得到与 GridSearchCV 相同的结果?

转载 作者:行者123 更新时间:2023-11-30 09:16:10 24 4
gpt4 key购买 nike

GridSearchCV 仅返回每个参数化的分数,我还希望看到 Roc 曲线以更好地理解结果。为了做到这一点,我想从 GridSearchCV 中获取性能最佳的模型并重现这些相同的结果,但缓存概率。这是我的代码

import numpy as np
import pandas as pd
from sklearn.datasets import make_classification
from sklearn.decomposition import PCA
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.feature_selection import SelectFromModel
from sklearn.metrics import roc_auc_score
from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import StratifiedKFold
from sklearn.pipeline import Pipeline
from tqdm import tqdm

import warnings
warnings.simplefilter("ignore")

data = make_classification(n_samples=100, n_features=20, n_classes=2,
random_state=1, class_sep=0.1)
X, y = data


small_pipe = Pipeline([
('rfs', SelectFromModel(RandomForestClassifier(n_estimators=100))),
('clf', LogisticRegression())
])

params = {
'clf__class_weight': ['balanced'],
'clf__penalty' : ['l1', 'l2'],
'clf__C' : [0.1, 0.5, 1.0],
'rfs__max_features': [3, 5, 10]
}
key_feats = ['mean_train_score', 'mean_test_score', 'param_clf__C',
'param_clf__penalty', 'param_rfs__max_features']

skf = StratifiedKFold(n_splits=5, random_state=0)

all_results = list()
for _ in tqdm(range(25)):
gs = GridSearchCV(small_pipe, param_grid=params, scoring='roc_auc', cv=skf, n_jobs=-1);
gs.fit(X, y);
results = pd.DataFrame(gs.cv_results_)[key_feats]
all_results.append(results)


param_group = ['param_clf__C', 'param_clf__penalty', 'param_rfs__max_features']
all_results_df = pd.concat(all_results)
all_results_df.groupby(param_group).agg(['mean', 'std']
).sort_values(('mean_test_score', 'mean'), ascending=False).head(20)

这是我重现结果的尝试

small_pipe_w_params = Pipeline([
('rfs', SelectFromModel(RandomForestClassifier(n_estimators=100), max_features=3)),
('clf', LogisticRegression(class_weight='balanced', penalty='l2', C=0.1))
])
skf = StratifiedKFold(n_splits=5, random_state=0)
all_scores = list()
for _ in range(25):
scores = list()
for train, test in skf.split(X, y):
small_pipe_w_params.fit(X[train, :], y[train])
probas = small_pipe_w_params.predict_proba(X[test, :])[:, 1]
# cache probas here to build an Roc w/ conf interval later
scores.append(roc_auc_score(y[test], probas))
all_scores.extend(scores)

print('mean: {:<1.3f}, std: {:<1.3f}'.format(np.mean(all_scores), np.std(all_scores)))

我多次运行上述命令,因为结果似乎不稳定。我创建了一个具有挑战性的数据集,因为我自己的数据集同样难以学习。 groupby 旨在采用 GridSearchCV 的所有迭代并对训练和测试分数进行平均和标准差以稳定结果。然后,我挑选出性能最佳的模型(在我最近的模型中,C=0.1、penalty=l2 和 max_features=3),并在故意放入这些参数时尝试重现这些相同的结果。

GridSearchCV 模型产生 0.63 平均值和 0.042 std roc 分数,而我自己的实现得到 0.59 平均值和 std 0.131 roc 分数。网格搜索得分要好得多。如果我对 GSCV 和我自己的实验进行 100 次迭代,结果是相似的。

为什么这些结果不一样?当提供 cv 的整数时,它们都在内部使用 StratifiedKFold() ......并且也许 GridSearchCV 按折叠大小对分数进行加权?我不确定这一点,但这是有道理的。我的实现有缺陷吗?

编辑:random_state 添加到 SKFold

最佳答案

如果您设置RandomForestClassifier的random_state集合,则不同girdsearchCV之间的差异将被消除。

为了简化,我设置了 n_estimators =10 并得到了以下结果

                                                             mean_train_score           mean_test_score
param_clf__C param_clf__penalty param_ rfs_max_features mean std mean std
1.0 l2 5 0.766701 0.000000 0.580727 0.0 10 0.768849 0.000000 0.577737 0.0

现在,如果查看最佳超参数的每个分割(通过删除 key_feats 过滤)的性能,请使用

all_results_df.sort_values(('mean_test_score'), ascending=False).head(1).T

我们会得到

    16
mean_fit_time 0.228381
mean_score_time 0.113187
mean_test_score 0.580727
mean_train_score 0.766701
param_clf__C 1
param_clf__class_weight balanced
param_clf__penalty l2
param_rfs__max_features 5
params {'clf__class_weight': 'balanced', 'clf__penalt...
rank_test_score 1
split0_test_score 0.427273
split0_train_score 0.807051
split1_test_score 0.47
split1_train_score 0.791745
split2_test_score 0.54
split2_train_score 0.789243
split3_test_score 0.78
split3_train_score 0.769856
split4_test_score 0.7
split4_train_score 0.67561
std_fit_time 0.00586908
std_score_time 0.00152781
std_test_score 0.13555
std_train_score 0.0470554

让我们重现这个!

skf = StratifiedKFold(n_splits=5, random_state=0)
all_scores = list()

scores = []
weights = []


for train, test in skf.split(X, y):
small_pipe_w_params = Pipeline([
('rfs', SelectFromModel(RandomForestClassifier(n_estimators=10,
random_state=0),max_features=5)),
('clf', LogisticRegression(class_weight='balanced', penalty='l2', C=1.0,random_state=0))
])
small_pipe_w_params.fit(X[train, :], y[train])
probas = small_pipe_w_params.predict_proba(X[test, :])
# cache probas here to build an Roc w/ conf interval later
scores.append(roc_auc_score(y[test], probas[:,1]))
weights.append(len(test))

print(scores)
print('mean: {:<1.6f}, std: {:<1.3f}'.format(np.average(scores, axis=0, weights=weights), np.std(scores)))

[0.42727272727272736, 0.47, 0.54, 0.78, 0.7]
mean: 0.580727, std: 0.135

注意:mean_test_score 不仅仅是简单平均值,它是加权平均值。原因是iid param

来自Documentation :

iid : boolean, default=’warn’ If True, return the average score across folds, weighted by the number of samples in each test set. In this case, the data is assumed to be identically distributed across the folds, and the loss minimized is the total loss per sample, and not the mean loss across the folds. If False, return the average score across folds. Default is True, but will change to False in version 0.21, to correspond to the standard definition of cross-validation.

Changed in version 0.20: Parameter iid will change from True to False by default in version 0.22, and will be removed in 0.24.

关于python - 为什么我不能得到与 GridSearchCV 相同的结果?,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/55717820/

24 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com