gpt4 book ai didi

python - KFolds 交叉验证与 train_test_split

转载 作者:太空宇宙 更新时间:2023-11-03 14:01:09 26 4
gpt4 key购买 nike

我今天刚刚构建了我的第一个随机森林分类器,我正在努力提高它的性能。我正在阅读有关交叉验证对于避免数据过度拟合并因此获得更好结果的重要性。我使用 sklearn 实现了 StratifiedKFold,但是,令人惊讶的是这种方法的准确性较低。我读过很多帖子表明 cross-validatingtrain_test_split 更有效。

估算器:

rf = RandomForestClassifier(n_estimators=100, random_state=42)

K 折:

ss = StratifiedKFold(n_splits=10, shuffle=True, random_state=42)
for train_index, test_index in ss.split(features, labels):
train_features, test_features = features[train_index], features[test_index]
train_labels, test_labels = labels[train_index], labels[test_index]

语音合成:

train_feature, test_feature, train_label, test_label = \
train_test_split(features, labels, train_size=0.8, test_size=0.2, random_state=42)

结果如下:

简历:

AUROC:  0.74
Accuracy Score: 74.74 %.
Specificity: 0.69
Precision: 0.75
Sensitivity: 0.79
Matthews correlation coefficient (MCC): 0.49
F1 Score: 0.77

语音合成:

AUROC:  0.76
Accuracy Score: 76.23 %.
Specificity: 0.77
Precision: 0.79
Sensitivity: 0.76
Matthews correlation coefficient (MCC): 0.52
F1 Score: 0.77

这真的可能吗?还是我错误地设置了我的模型?

此外,这是使用交叉验证的正确方法吗?

最佳答案

很高兴看到您记录了自己!

造成这种差异的原因是 TTS 方法引入了偏差(因为您没有使用所有观察结果进行测试),这解释了差异。

In the validation approach, only a subset of the observations—those that are included in the training set rather than in the validation set—are used to fit the model. Since statistical methods tend to perform worse when trained on fewer observations, this suggests that the validation set error rate may tend to overestimate the test error rate for the model fit on the entire data set.

结果可能相差很大:

the validation estimate of the test error rate can be highly variable, depending on precisely which observations are included in the training set and which observations are included in the validation set

交叉验证通过使用所有可用数据来消除偏差来处理这个问题。

此处,您的 TTS 方法结果存在更多偏差,在分析结果时应牢记这一点。也许你在测试/验证集采样上也很幸运

再次,这里有一篇对初学者友好的很棒的文章,详细介绍了该主题: https://codesachin.wordpress.com/2015/08/30/cross-validation-and-the-bias-variance-tradeoff-for-dummies/

有关更深入的来源,请参阅“模型评估和选择”此处章节(引用内容来源):

https://web.stanford.edu/~hastie/Papers/ESLII.pdf

关于python - KFolds 交叉验证与 train_test_split,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/49134338/

26 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com