python - 无论我的训练集有多小，测试准确性始终很高-6ren

python - 无论我的训练集有多小，测试准确性始终很高

转载作者：行者123 更新时间：2023-11-30 10:00:46

24

4

我正在做一个项目，试图将评论分为不同的类别:“有毒”、“严重有毒”、“淫秽”、“侮辱”、“身份仇恨”。我使用的数据集来自 Kaggle 挑战赛:https://www.kaggle.com/c/jigsaw-toxic-comment-classification-challenge 。我当前面临的问题是，无论我的数据适合多小的训练数据集，当我预测测试数据的标签时，我的准确度始终在 90% 左右或以上。在本例中，我正在对 15 行数据进行训练并在 159,556 行数据上进行测试。我通常会很高兴能够获得较高的测试准确度，但在这种情况下，我感觉自己做错了什么。

我正在将数据读入 pandas 数据框:

trainData = pd.read_csv('train.csv')

这是打印时数据的样子:

                      id                                       comment_text  \
0       0000997932d777bf  Explanation\nWhy the edits made under my usern...   
1       000103f0d9cfb60f  D'aww! He matches this background colour I'm s...   
2       000113f07ec002fd  Hey man, I'm really not trying to edit war. It...   
3       0001b41b1c6bb37e  "\nMore\nI can't make any real suggestions on ...   
4       0001d958c54c6e35  You, sir, are my hero. Any chance you remember...   
...                  ...                                                ...   
159566  ffe987279560d7ff  ":::::And for the second time of asking, when ...   
159567  ffea4adeee384e90  You should be ashamed of yourself \n\nThat is ...   
159568  ffee36eab5c267c9  Spitzer \n\nUmm, theres no actual article for ...   
159569  fff125370e4aaaf3  And it looks like it was actually you who put ...   
159570  fff46fc426af1f9a  "\nAnd ... I really don't think you understand...   

        toxic  severe_toxic  obscene  threat  insult  identity_hate  
0           0             0        0       0       0              0  
1           0             0        0       0       0              0  
2           0             0        0       0       0              0  
3           0             0        0       0       0              0  
4           0             0        0       0       0              0  
...       ...           ...      ...     ...     ...            ...  
159566      0             0        0       0       0              0  
159567      0             0        0       0       0              0  
159568      0             0        0       0       0              0  
159569      0             0        0       0       0              0  
159570      0             0        0       0       0              0  

[159571 rows x 8 columns]

然后我使用 train_test_split 将数据分为训练和测试:

X = trainData.drop(labels= ['id','toxic','severe_toxic','obscene','threat','insult','identity_hate'],axis=1)
Y = trainData.drop(labels = ['id','comment_text'],axis=1)

trainX,testX,trainY,testY = train_test_split(X,Y,test_size=0.9999,random_state=99)

我正在使用 sklearn 的 HashingVectorizer 将评论转换为数值向量以进行分类:

def hashVec():
    trainComments=[]
    testComments=[]
    for index,row in trainX.iterrows():
        trainComments.append(row['comment_text'])
    for index,row in testX.iterrows():
        testComments.append(row['comment_text'])
    vectorizer = HashingVectorizer()
    trainSamples = vectorizer.transform(trainComments)
    testSamples = vectorizer.transform(testComments)
    return trainSamples,testSamples

我使用 sklearn 中的 OneVsRestClassifier 和 LogisticRegression 来拟合和预测 6 个类别中每一个类别的数据

def logRegOVR(trainSamples,testSamples):
    commentTypes=['toxic','severe_toxic','obscene','threat','insult','identity_hate']
    clf = OneVsRestClassifier(LogisticRegression(solver='sag'))
    for cType in commentTypes:
        print(cType,":")
        clf.fit(trainSamples,trainY[cType])
        pred1 = clf.predict(trainSamples)
        print("\tTrain Accuracy:",accuracy_score(trainY[cType],pred1))
        prediction = clf.predict(testSamples)
        print("\tTest Accuracy:",accuracy_score(testY[cType],prediction))

最后，这是我调用函数的地方，以及我得到的输出:

sol = hashVec()
logRegOVR(sol[0],sol[1])

toxic :
    Train Accuracy: 0.8666666666666667
    Test Accuracy: 0.9041590413397177
severe_toxic :
    Train Accuracy: 1.0
    Test Accuracy: 0.9900035097395272
obscene :
    Train Accuracy: 1.0
    Test Accuracy: 0.9470468048835519
threat :
    Train Accuracy: 1.0
    Test Accuracy: 0.9970041866178646
insult :
    Train Accuracy: 1.0
    Test Accuracy: 0.9506317531148938
identity_hate :
    Train Accuracy: 1.0
    Test Accuracy: 0.9911943142219659

当我有一个更合理的 train_test_split(80% 训练和 20% 测试)时，测试精度非常相似。

感谢您的帮助

最佳答案

您没有使用好的指标:准确性并不是确定您是否做得正确的好方法。我建议您查看我们所说的 F1 分数，它是精确度和召回率的混合，我发现它对于评估我的分类器的工作方式更相关

关于python - 无论我的训练集有多小，测试准确性始终很高，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/59131043/

24

4

0

文章推荐： python - 从 OneHotEncoder 获取对应的特征

文章推荐： javascript - 浏览器不会在选项标签上设置 selected 属性

文章推荐： java - 如何向后读取整数文件

PHP sleep 准确性
我使用以下代码来查看用户在特定页面上的停留时间。我为此脚本使用了带有 src 属性的隐藏图像: $timer_seconds = 1; while(!connection_aborted()) {
具有自定义损失函数的 Keras 准确性
我在 Keras 中使用自定义损失函数: def get_top_one_probability(vector): return (K.exp(vector) / K.sum(K.exp(vect
java - 基本数据类型准确性
当我使用 long 来节省一个月毫秒时，我发现一个问题。但我打印负数。所以我做了一个测试代码如下: LogUtils.d(TAG, "long max time:"+Long.MAX_VALUE);
python - 解释训练损失/准确性与验证损失/准确性
关于使用 Lenet5 网络解释某些优化器在 MNIST 上的性能，我有几个问题，以及验证损失/准确性与训练损失/准确性图表究竟告诉我们什么。所以一切都是在 Keras 中使用标准的 LeNet5 网
azure - Azure 搜索索引的大小是否会影响性能/准确性？
我有 1000 个 pdf(每个 200 页)。我需要将每个 pdf 添加到 Azure 搜索索引中的索引(作为小文本 block 和相关元数据，例如每个 pdf 200 个 block ) 已达到
SQL Server DATEDIFF 准确性
我必须在 mssql 数据库中存储一些间隔。我知道日期时间的准确性约为。 3.3ms(只能结束0、3、7)。但是当我计算日期时间之间的间隔时，我发现结果只能以 0、3 和 6 结尾。所以我总结的间隔越
java - 融合位置管理器 API 准确性
我想制作一个需要将位置精确到大约 1m 或更小的 Android 应用程序。“Fused Location Manager API”是否足够好，或者 GPS 永远不会如此准确，无论是否与其他传感器融合
python - pySerial list_port 准确性
我想使用 pySerial 的 serial.tools.list_ports.comports() 列出可用的 COM 端口。阅读documentation : The function retu
python - pyomo 准确性；客观规则不返回期望值
使用 pyomo 和 glpk 求解器，我定义了以下目标规则: def cost_rule(m): return (sum(m.rd[i]*m.pRdImp*m.dt - m.vr[i]*m.
java - Lucene 空间、准确性
我正在遵循“Lucene in Action”中的示例，第 308-315 页，它描述了 Lucene Spatial。我正在使用 lucene 2.9.4。我用过 http://geocoder.u
android - 线程 sleep 准确性
我一直在试验各种计时方法的代码。创建延迟的一种方法是使用thread.sleep（millis）运行线程，但可以很好地说明，线程“唤醒”的时间并不完全准确，可能在这个时间之前或之后。然后我遇到一个定义
C++提高 sleep 准确性
我在使用 boost::sleep() 函数时遇到奇怪的问题。我有这个基本代码: #include #include #include void thread_func() { time
python - pytesseract 提高了图像上模糊数字的 OCR 准确性
数字示例我正在使用标准的 pytesseract img 来发送文本。我尝试过仅使用数字选项，90% 的情况下它是完美的，但上面是一个非常错误的例子!这个例子根本没有产生任何字符如您所见，现在有字
python - Pytesseract 提高 OCR 准确性
我想从 python 中的图像中提取文本.为了做到这一点，我选择了 pytesseract .当我尝试从图像中提取文本时，结果并不令人满意。我也经历过this并实现了列出的所有技术。然而，它的表现似乎
tensorflow - 损失、准确性、验证损失、验证准确性之间有什么区别？
在每个时代结束时，我得到例如以下输出: Epoch 1/25 2018-08-06 14:54:12.555511: 2/2 [==============================] - 86
barcode - 二维条码与一维条码 - 速度、准确性、尺寸
我想为我的移动项目需求之一实现条形码。要存储的数据量非常少(<25 个字母数字)。我想知道对于这个项目实现一维条形码或二维条形码(特别是二维码)是否更明智。如果有人能从 1d 与 2d 的角度对我进行
python - 二元分类问题中每个概率截止的准确性(python sklearn 准确性)
想象一个二元分类问题。假设我在 pred_test 中存储了 800,000 个预测概率。我将 cutoff 定义为 pred_test 中的任何值，以便大于或等于 cutoff 的值被分配值 1 和
python - “Booster”对象没有属性 'score' - 准确性
已关闭。此问题需要 debugging details 。目前不接受答案。编辑问题以包含 desired behavior, a specific problem or error, and the
android - Android 设备运行时的 iBeacon 准确性
我正在使用 iBeacon 和 Altbeacon 测试定位系统。我发现我的三角测量结果实际上非常准确，但有时需要 5 秒以上才能看到正确的结果。例如，假设我目前正站在A点。 Altbeacon +
c# - 比较 2 个数据表以查找列之间的差异/准确性
因此，我有 2 个独立的数据表，它们看起来非常相同，但它们行中的值可能不同。编辑: 我可以通过创建一个可以用作主键的临时标识列来获得唯一 ID，如果这样做更容易的话。所以将 ID 列视为主键。表A

首页

博学

6Ren·AI

商城

python - 无论我的训练集有多小，测试准确性始终很高