python - 如何通过交叉验证检测过度拟合 : What should be the difference threshold?-6ren

python - 如何通过交叉验证检测过度拟合 : What should be the difference threshold?

转载作者：行者123 更新时间：2023-11-30 09:40:25

25

4

建立分类模型后，我通过准确率、精确率和召回率对其进行评估。为了检查过度拟合，我使用了 K Fold Cross Validation。我知道，如果我的模型分数与交叉验证分数相差很大，那么我的模型就过度拟合了。然而，我一直不知道如何定义阈值。就像分数差异有多大实际上可以推断出模型过度拟合。例如，以下是 3 个分割(3 Fold CV、shuffle= True、random_state= 42)及其各自在 Logistic 回归模型上的得分:

Split Number  1
Accuracy= 0.9454545454545454
Precision= 0.94375
Recall= 1.0

Split Number  2
Accuracy= 0.9757575757575757
Precision= 0.9753086419753086
Recall= 1.0

Split Number  3
Accuracy= 0.9695121951219512
Precision= 0.9691358024691358
Recall= 1.0

无需 CV 直接训练 Logistic 回归模型:

Accuracy= 0.9530201342281879
Precision= 0.952054794520548
Recall= 1.0

那么我如何决定我的分数需要变化多大才能推断出过度拟合的情况？

最佳答案

我假设您正在使用 Cross-validation :

这将分割您的训练和测试数据。

现在你可能已经实现了类似的东西:

from sklearn.model_selection import cross_validate
from sklearn.metrics import recall_score
scoring = ['precision_macro', 'recall_macro']
clf = svm.SVC(kernel='linear', C=1, random_state=0)
scores = cross_validate(clf, iris.data, iris.target, scoring=scoring,cv=5)

所以现在您只计算测试分数，在所有 3 种情况下，该分数都非常好。

第一个选项是:

return_train_score is set to False by default to save computation time. To evaluate the scores on the training set as well you need to be set to True

在那里您还可以看到弃牌的训练分数。如果您看到训练集的准确度为 1.0，则说明过度拟合。

另一个选项是:多跑几次 fork 。然后你就可以确定算法没有过度拟合，如果每个测试分数都具有很高的准确性，那么你就做得很好。

您添加了基线吗？我假设它是二元分类，而且我感觉数据集高度不平衡，因此 0.96 的准确度一般来说可能不太好，因为您的虚拟分类(始终为一类)的准确度为 0.95。

关于python - 如何通过交叉验证检测过度拟合 : What should be the difference threshold?，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/59063349/

25

4

0

文章推荐： java - 向上转换时 AspectJ 和 Spring LTW 不起作用

文章推荐： java - 将依赖项 jar 条目添加到 jar 索引 (/META-INF/INDEX.LIST)

jmeter - 谁能帮我理解 Jmeter 仪表板报告中的 -'Toleration threshold' 和 'Frustration threshold' 术语
任何人都可以帮助我理解 Jmeter 仪表板报告中的术语“容忍阈值”和“挫折阈值” enter image description here 最佳答案 APDEX 说明 here 要计算它，JMete
emacs - Emacs 中显示缓冲区 split-height-threshold 和 split-width-threshold 的反向求值顺序
当display-buffer必须在现有 Pane 中创建一个新窗口，Emacs manual声明 split-height-threshold首先查看新窗口是否可以低于当前窗口，然后 split-w
r - 错误在 "if (reached.threshold < min.reached.threshold) {": missing value where TRUE/FALSE needed
我收到以下错误消息: Error in "if (reached.threshold < min.reached.threshold) {" : missing value wher
r - 错误在 "if (reached.threshold < min.reached.threshold) {": missing value where TRUE/FALSE needed
我收到以下错误消息: Error in "if (reached.threshold < min.reached.threshold) {" : missing value wher
threshold - yolov3.cfg中YOLO层中ignore_thresh和truth_thresh的目的是什么？
我试图解释 yolov3.cfg 文件中不同参数的用途，但是，我找不到关于 ignore_thresh 和 truth_thresh 的任何解释。我目前(有限的)理解是，它们要么与作为组合边界框的阈
mongoDB查询 "WHERE _id > threshold"
我怎样才能有一个类似于 SQL“...WHERE _id > threshold”的 mongo 查询我尝试了以下方法，但没有任何结果。 db.things.find(_id: {$gt: som
python - 如何通过交叉验证检测过度拟合 : What should be the difference threshold?
建立分类模型后，我通过准确率、精确率和召回率对其进行评估。为了检查过度拟合，我使用了 K Fold Cross Validation。我知道，如果我的模型分数与交叉验证分数相差很大，那么我的模型就过度
python - cv2.threshold 转换它不应该的细胞
在下面的代码中，我有一个 8 位整数的 numpy 数组。我想对它们应用一个阈值，所以我调用 cv2.threshold(img,128,1,cv2.THRSH_TOZERO)[1] .文档表明该函数
java - 是否有任何 "threshold"证明多线程计算是合理的？
所以基本上我今天需要优化这段代码。它试图找到某个函数为前百万个起始数字生成的最长序列: public static void main(String[] args) { int mostLen
python - Adaptive Threshold 参数混淆
谁能告诉我这些自适应阈值函数中的参数是什么以及它们如何控制黑白像素。 cv2.adaptiveThreshold(img,255,cv2.ADAPTIVE_THRESH_MEAN_C,\
java - 存储在垃圾收集过程中的引用对象的 "age threshold"存储在哪里？
我正在尝试了解 GC 的工作原理并且一直在阅读 https://www.oracle.com/webfolder/technetwork/tutorials/obe/java/gc01/index.h
python - 多标签分类: How to learn threshold values?
我有一个深度 CNN，可以很好地进行多类分类。我想“升级”挑战并针对多标签分类问题对其进行训练。为此，我用 sigmoid 替换了 softmax，并尝试训练我的网络以最小化: tf.reduce_
python ， Pandas : Groupby Threshold Value
我有一个 DataFrame，如下所示: 我想使用 GroupBy 方法来返回行，例如: "'gain_by_mae' > 1 的所有行", "'entry_time' > 8:00 和 'entry
python - 理解 cv2.threshold() 函数
我运行这段代码: import cv2 import numpy as np from matplotlib import pyplot as plt im=cv2.imread('1.jpg') #
python - 使用 cv2.threshold() 函数绘制轮廓
我正在测试 cv2.threshold()使用不同的值运行，但我每次都会得到意想不到的结果。所以这意味着我根本不明白 parameter 的效果: 最大有人可以解决这个问题吗？比如我想按照白色绘制
python - cv2.threshold() 错误 (-210)
我是 Python 新手。我想借助傅立叶变换定义文本旋转。 import cv2 import numpy as np import matplotlib.pyplot as plot img =
python - Pandas groupby : percentage above threshold
我有一个 DataFrame，我希望在其上使用 groupby，但我正在寻找一些不寻常的函数来进行聚合。我想让每个组中的观察百分比超过某个阈值。例如，阈值为 0 时，DataFrame df = pd
elasticsearch - 格拉法纳 : Average Calculation + Dynamic Threshold
我是 Grafana World 的新手。我需要和你们澄清两件事: 1)无论如何要在时间范围内动态更改阈值？ 2)Grafana 如何计算平均值？有没有办法通过使用 lucene 将总计数除以常量变量
amazon-web-services - ELB健康检查行为-Health Threshold
在设置 ELB 健康检查的对话框中指出: If an instance fails the health check, it is automatically removed from the loa
css - Microsoft Edge和shape-image-threshold:不起作用
Closed. This question is off-topic。它当前不接受答案。

首页

博学

6Ren·AI

商城

python - 如何通过交叉验证检测过度拟合 : What should be the difference threshold?