gpt4 book ai didi

python - AUC 高,但数据不平衡导致预测不佳

转载 作者:行者123 更新时间:2023-11-30 08:24:03 28 4
gpt4 key购买 nike

我正在尝试在非常不平衡的数据集上使用 LightGBM 构建分类器。不平衡的比例为 97:3,即:

Class

0 0.970691
1 0.029309

我使用的参数和训练代码如下所示。

lgb_params = {
'boosting_type': 'gbdt',
'objective': 'binary',
'metric':'auc',
'learning_rate': 0.1,
'is_unbalance': 'true', #because training data is unbalance (replaced with scale_pos_weight)
'num_leaves': 31, # we should let it be smaller than 2^(max_depth)
'max_depth': 6, # -1 means no limit
'subsample' : 0.78
}

# Cross-validate
cv_results = lgb.cv(lgb_params, dtrain, num_boost_round=1500, nfold=10,
verbose_eval=10, early_stopping_rounds=40)

nround = cv_results['auc-mean'].index(np.max(cv_results['auc-mean']))
print(nround)

model = lgb.train(lgb_params, dtrain, num_boost_round=nround)


preds = model.predict(test_feats)

preds = [1 if x >= 0.5 else 0 for x in preds]

我运行 CV 以获得最佳模型和最佳回合。我在 CV 上得到了 0.994 AUC,在验证集中得到了类似的分数。

但是当我在测试集上进行预测时,我得到了非常糟糕的结果。我确信训练集的采样是完美的。

需要调整哪些参数?问题的原因是什么?我是否应该对数据集重新采样以减少最高类别?

最佳答案

问题是,尽管数据集中存在极端的类别不平衡,但在决定最终的硬分类时,您仍然使用“默认”阈值 0.5

preds = [1 if x >= 0.5 else 0 for x in preds]

这里的情况应该

这是一个相当大的主题,我强烈建议您进行自己的研究(尝试使用谷歌搜索阈值切断概率不平衡数据),但这里有一些指导您入门...

来自Cross Validated的相关答案(强调):

Don't forget that you should be thresholding intelligently to make predictions. It is not always best to predict 1 when the model probability is greater 0.5. Another threshold may be better. To this end you should look into the Receiver Operating Characteristic (ROC) curves of your classifier, not just its predictive success with a default probability threshold.

来自相关学术论文,Finding the Best Classification Threshold in Imbalanced Classification :

2.2. How to set the classification threshold for the testing set

Predictionresultsareultimatelydeterminedaccordingtopredictionprobabilities.Thethresholdistypicallysetto0.5.Ifthepredictionprobabilityexceeds0.5,thesampleispredictedtobepositive;otherwise,negative.However,0.5isnotidealforsomecases,particularlyforimbalanceddatasets.

帖子Optimizing Probability Thresholds for Class Imbalances来自(强烈推荐)应用预测建模博客的内容也很相关。

从上述所有内容中吸取教训:AUC 很少足够,但 ROC曲线本身通常是您最好的 friend ......

<小时/>

在更一般的层面上,关于阈值本身在分类过程中的作用(至少根据我的经验,许多从业者都犯了错误),还请检查 Classification probability threshold交叉验证的线程(以及提供的链接);要点:

the statistical component of your exercise ends when you output a probability for each class of your new sample. Choosing a threshold beyond which you classify a new observation as 1 vs. 0 is not part of the statistics any more. It is part of the decision component.

关于python - AUC 高,但数据不平衡导致预测不佳,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/51190809/

28 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com