pytorch - Huggingface bert 表现出较差的准确性/f1 分数 [pytorch]-6ren

pytorch - Huggingface bert 表现出较差的准确性/f1 分数 [pytorch]

转载作者：行者123 更新时间：2023-12-03 23:47:11

我正在尝试 BertForSequenceClassification用于简单的文章分类任务。

无论我如何训练(卡住除分类层之外的所有层、所有可训练层、最后 k 层可训练)，我总是得到几乎随机的准确度分数。我的模型没有超过 24-26% 的训练准确度(我的数据集中只有 5 个类)。

我不确定在设计/训练模型时我做错了什么。我尝试了多个数据集的模型，每次它都给出相同的随机基线精度。

我使用的数据集:BBC 文章(5 类)

https://github.com/zabir-nabil/pytorch-nlp/tree/master/bbc

Consists of 2225 documents from the BBC news website corresponding to stories in five topical areas from 2004-2005. Natural Classes: 5 (business, entertainment, politics, sport, tech)

我添加了模型部分和训练部分，这是最重要的部分(以避免任何不相关的细节)。如果这对重现性有用，我也添加了完整的源代码 + 数据。

我的猜测是我设计网络的方式或我将 attention_masks/标签传递给模型的方式有问题。此外， token 长度 512 应该不是问题，因为大多数文本的长度 < 512(平均长度 < 300)。

型号代码:

import torch
from torch import nn

class BertClassifier(nn.Module):
    def __init__(self):
        super(BertClassifier, self).__init__()
        self.bert = BertForSequenceClassification.from_pretrained('bert-base-uncased', num_labels = 5)
        # as we have 5 classes

        # we want our output as probability so, in the evaluation mode, we'll pass the logits to a softmax layer
        self.softmax = torch.nn.Softmax(dim = 1) # last dimension
    def forward(self, x, attn_mask = None, labels = None):

        if self.training == True:
            # print(x.shape)
            loss = self.bert(x, attention_mask = attn_mask, labels = labels)
            # print(x[0].shape)

            return loss

        if self.training == False: # in evaluation mode
            x = self.bert(x)
            x = self.softmax(x[0])

            return x
    def freeze_layers(self, last_trainable = 1): 
        # we freeze all the layers except the last classification layer + few transformer blocks
        for layer in list(self.bert.parameters())[:-last_trainable]:
            layer.requires_grad = False


# create our model

bertclassifier = BertClassifier()

培训代码:

device = torch.device("cuda" if torch.cuda.is_available() else "cpu") # cuda for gpu acceleration

# optimizer

optimizer = torch.optim.Adam(bertclassifier.parameters(), lr=0.001)


epochs = 15

bertclassifier.to(device) # taking the model to GPU if possible

# metrics

from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score

train_losses = []

train_metrics = {'acc': [], 'f1': []}
test_metrics = {'acc': [], 'f1': []}

# progress bar

from tqdm import tqdm_notebook

for e in tqdm_notebook(range(epochs)):
    train_loss = 0.0
    train_acc = 0.0
    train_f1 = 0.0
    batch_cnt = 0

    bertclassifier.train()

    print(f'epoch: {e+1}')

    for i_batch, (X, X_mask, y) in tqdm_notebook(enumerate(bbc_dataloader_train)):
        X = X.to(device)
        X_mask = X_mask.to(device)
        y = y.to(device)


        optimizer.zero_grad()

        loss, y_pred = bertclassifier(X, X_mask, y)

        train_loss += loss.item()
        loss.backward()
        optimizer.step()

        y_pred = torch.argmax(y_pred, dim = -1)

        # update metrics
        train_acc += accuracy_score(y.cpu().detach().numpy(), y_pred.cpu().detach().numpy())
        train_f1 += f1_score(y.cpu().detach().numpy(), y_pred.cpu().detach().numpy(), average = 'micro')
        batch_cnt += 1

    print(f'train loss: {train_loss/batch_cnt}')
    train_losses.append(train_loss/batch_cnt)
    train_metrics['acc'].append(train_acc/batch_cnt)
    train_metrics['f1'].append(train_f1/batch_cnt)


    test_loss = 0.0
    test_acc = 0.0
    test_f1 = 0.0
    batch_cnt = 0

    bertclassifier.eval()
    with torch.no_grad():
        for i_batch, (X, y) in enumerate(bbc_dataloader_test):
            X = X.to(device)
            y = y.to(device)

            y_pred = bertclassifier(X) # in eval model we get the softmax output so, don't need to index


            y_pred = torch.argmax(y_pred, dim = -1)

            # update metrics
            test_acc += accuracy_score(y.cpu().detach().numpy(), y_pred.cpu().detach().numpy())
            test_f1 += f1_score(y.cpu().detach().numpy(), y_pred.cpu().detach().numpy(), average = 'micro')
            batch_cnt += 1

    test_metrics['acc'].append(test_acc/batch_cnt)
    test_metrics['f1'].append(test_f1/batch_cnt)

数据集的完整源代码可在此处获得: https://github.com/zabir-nabil/pytorch-nlp/blob/master/bert-article-classification.ipynb

更新:

观察预测后，似乎模型几乎总是预测为 0:

bertclassifier.eval()
with torch.no_grad():
    for i_batch, (X, y) in enumerate(bbc_dataloader_test):
        X = X.to(device)
        y = y.to(device)

        y_pred = bertclassifier(X) # in eval model we get the softmax output so, don't need to index


        y_pred = torch.argmax(y_pred, dim = -1)

        print(y)
        print(y_pred)
        print('--------------------')

tensor([4, 2, 2, 3], device='cuda:0')
tensor([0, 0, 0, 0], device='cuda:0')
--------------------
tensor([3, 0, 3, 1], device='cuda:0')
tensor([0, 0, 0, 0], device='cuda:0')
--------------------
tensor([0, 0, 0, 2], device='cuda:0')
tensor([0, 0, 0, 0], device='cuda:0')
--------------------
tensor([3, 4, 4, 3], device='cuda:0')
tensor([0, 0, 0, 0], device='cuda:0')
--------------------
tensor([4, 3, 2, 0], device='cuda:0')
tensor([0, 0, 0, 0], device='cuda:0')
--------------------
tensor([0, 3, 3, 1], device='cuda:0')
tensor([0, 0, 0, 0], device='cuda:0')
--------------------
tensor([1, 1, 4, 3], device='cuda:0')
tensor([0, 0, 0, 0], device='cuda:0')
--------------------
tensor([0, 0, 0, 1], device='cuda:0')
tensor([0, 0, 0, 0], device='cuda:0')
--------------------
tensor([3, 3, 1, 3], device='cuda:0')
tensor([0, 0, 0, 0], device='cuda:0')
--------------------
tensor([3, 2, 4, 1], device='cuda:0')
tensor([0, 0, 0, 0], device='cuda:0')
--------------------
tensor([3, 3, 1, 1], device='cuda:0')
tensor([0, 0, 0, 0], device='cuda:0')
--------------------
tensor([3, 0, 1, 3], device='cuda:0')
tensor([0, 0, 0, 0], device='cuda:0')
--------------------
tensor([1, 0, 1, 0], device='cuda:0')
tensor([0, 0, 0, 0], device='cuda:0')
--------------------
tensor([4, 3, 1, 0], device='cuda:0')
tensor([0, 0, 0, 0], device='cuda:0')
--------------------
tensor([2, 2, 0, 4], device='cuda:0')
tensor([0, 0, 0, 0], device='cuda:0')
--------------------
tensor([3, 1, 2, 2], device='cuda:0')
tensor([0, 0, 0, 0], device='cuda:0')
--------------------
tensor([3, 4, 3, 3], device='cuda:0')
tensor([0, 0, 0, 0], device='cuda:0')
--------------------
tensor([1, 3, 0, 4], device='cuda:0')
tensor([0, 0, 0, 0], device='cuda:0')
--------------------
tensor([3, 3, 0, 1], device='cuda:0')
tensor([0, 0, 0, 0], device='cuda:0')
--------------------
tensor([2, 3, 2, 4], device='cuda:0')
tensor([0, 0, 0, 0], device='cuda:0')
--------------------
tensor([3, 3, 1, 2], device='cuda:0')
tensor([0, 0, 0, 0], device='cuda:0')
--------------------
tensor([1, 2, 3, 0], device='cuda:0')
tensor([0, 0, 0, 0], device='cuda:0')
--------------------
tensor([4, 3, 3, 0], device='cuda:0')
tensor([0, 0, 0, 0], device='cuda:0')
--------------------
tensor([2, 4, 2, 4], device='cuda:0')
tensor([0, 0, 0, 0], device='cuda:0')
--------------------
tensor([2, 4, 4, 4], device='cuda:0')
tensor([0, 0, 0, 0], device='cuda:0')
--------------------
tensor([2, 1, 3, 2], device='cuda:0')
tensor([0, 0, 0, 0], device='cuda:0')
--------------------
tensor([3, 3, 2, 1], device='cuda:0')
tensor([0, 0, 0, 0], device='cuda:0')
--------------------
tensor([3, 0, 0, 1], device='cuda:0')
tensor([0, 0, 0, 0], device='cuda:0')
--------------------
tensor([4, 1, 4, 4], device='cuda:0')
tensor([0, 0, 0, 0], device='cuda:0')
--------------------
tensor([3, 4, 3, 2], device='cuda:0')
tensor([0, 0, 0, 0], device='cuda:0')
--------------------
tensor([1, 2, 1, 3], device='cuda:0')
tensor([0, 0, 0, 0], device='cuda:0')
--------------------
tensor([0, 3, 3, 0], device='cuda:0')
tensor([0, 0, 0, 0], device='cuda:0')
--------------------
tensor([1, 4, 0, 4], device='cuda:0')
tensor([0, 0, 0, 0], device='cuda:0')
--------------------
tensor([0, 1, 1, 4], device='cuda:0')
tensor([0, 0, 0, 0], device='cuda:0')
--------------------
tensor([4, 2, 4, 4], device='cuda:0')
tensor([0, 0, 0, 0], device='cuda:0')
--------------------
tensor([0, 3, 0, 4], device='cuda:0')
tensor([0, 0, 0, 0], device='cuda:0')
--------------------
tensor([0, 2, 3, 4], device='cuda:0')
tensor([0, 0, 0, 0], device='cuda:0')
--------------------
tensor([0, 3, 0, 3], device='cuda:0')
tensor([0, 0, 0, 0], device='cuda:0')
--------------------
tensor([0, 3, 1, 3], device='cuda:0')
tensor([0, 0, 0, 0], device='cuda:0')
--------------------
tensor([1, 2, 2, 1], device='cuda:0')
tensor([0, 0, 0, 0], device='cuda:0')
--------------------
tensor([1, 3, 2, 3], device='cuda:0')
tensor([0, 0, 0, 0], device='cuda:0')
--------------------
tensor([2, 3, 2, 4], device='cuda:0')
tensor([0, 0, 0, 0], device='cuda:0')
--------------------
tensor([1, 3, 0, 0], device='cuda:0')
tensor([0, 0, 0, 0], device='cuda:0')
--------------------
tensor([0, 1, 3, 0], device='cuda:0')
tensor([0, 0, 0, 0], device='cuda:0')
--------------------
tensor([0, 4, 0, 3], device='cuda:0')
tensor([0, 0, 0, 0], device='cuda:0')
--------------------
tensor([1, 3, 0, 4], device='cuda:0')
tensor([0, 0, 0, 0], device='cuda:0')
--------------------
tensor([4, 3, 3, 0], device='cuda:0')
tensor([0, 0, 0, 0], device='cuda:0')
--------------------
tensor([3, 2, 0, 3], device='cuda:0')
tensor([0, 0, 0, 0], device='cuda:0')
--------------------
tensor([0, 0, 0, 3], device='cuda:0')
tensor([0, 0, 0, 0], device='cuda:0')
--------------------
tensor([2, 0, 2, 0], device='cuda:0')
tensor([0, 0, 0, 0], device='cuda:0')
--------------------
tensor([2, 2, 3, 3], device='cuda:0')
tensor([0, 0, 0, 0], device='cuda:0')
--------------------
tensor([0, 2, 3, 2], device='cuda:0')
tensor([0, 0, 0, 0], device='cuda:0')
--------------------
tensor([2, 3, 0, 2], device='cuda:0')
tensor([0, 0, 0, 0], device='cuda:0')
--------------------
tensor([2, 0, 0, 0], device='cuda:0')
tensor([0, 0, 0, 0], device='cuda:0')
--------------------
tensor([3, 0, 2, 2], device='cuda:0')
tensor([0, 0, 0, 0], device='cuda:0')
--------------------
tensor([0, 4, 3, 0], device='cuda:0')
tensor([0, 0, 0, 0], device='cuda:0')
--------------------
tensor([4, 0, 4, 2], device='cuda:0')
tensor([0, 0, 0, 0], device='cuda:0')
--------------------
tensor([3, 0, 3, 4], device='cuda:0')
tensor([0, 0, 0, 0], device='cuda:0')
--------------------
tensor([4, 2, 0, 1], device='cuda:0')
tensor([0, 0, 0, 0], device='cuda:0')
--------------------
tensor([3, 3, 1, 0], device='cuda:0')
tensor([0, 0, 0, 0], device='cuda:0')
--------------------
tensor([3, 1, 3, 1], device='cuda:0')
tensor([0, 0, 0, 0], device='cuda:0')
--------------------
tensor([1, 3, 3, 0], device='cuda:0')
tensor([0, 0, 0, 0], device='cuda:0')
--------------------
tensor([2, 3, 0, 3], device='cuda:0')
tensor([0, 0, 0, 0], device='cuda:0')
--------------------
tensor([3, 2, 3, 4], device='cuda:0')
tensor([0, 0, 0, 0], device='cuda:0')
--------------------
tensor([2, 0, 0, 0], device='cuda:0')
tensor([0, 0, 0, 0], device='cuda:0')
--------------------
tensor([4, 0, 3, 3], device='cuda:0')
tensor([0, 0, 0, 0], device='cuda:0')
--------------------
tensor([0, 1, 1, 0], device='cuda:0')
tensor([0, 0, 0, 0], device='cuda:0')
--------------------
tensor([1, 1, 0, 4], device='cuda:0')
tensor([0, 0, 0, 0], device='cuda:0')
--------------------
tensor([1, 4, 1, 2], device='cuda:0')
tensor([0, 0, 0, 0], device='cuda:0')
--------------------
tensor([0, 3, 2, 3], device='cuda:0')
tensor([0, 0, 0, 0], device='cuda:0')
--------------------
tensor([1, 3, 4, 1], device='cuda:0')
tensor([0, 0, 0, 0], device='cuda:0')
--------------------
tensor([3, 0, 4, 0], device='cuda:0')
tensor([0, 0, 0, 0], device='cuda:0')
--------------------
tensor([1, 1, 3, 3], device='cuda:0')
tensor([0, 0, 0, 0], device='cuda:0')
--------------------
tensor([4, 4, 3, 1], device='cuda:0')
tensor([0, 0, 0, 0], device='cuda:0')
--------------------
tensor([2, 0, 3, 2], device='cuda:0')
tensor([0, 0, 0, 0], device='cuda:0')
--------------------
tensor([0, 3, 3, 4], device='cuda:0')
tensor([0, 0, 0, 0], device='cuda:0')
--------------------
tensor([4, 0, 3, 4], device='cuda:0')
tensor([0, 0, 0, 0], device='cuda:0')
--------------------
tensor([0, 0, 1, 2], device='cuda:0')
tensor([0, 0, 0, 0], device='cuda:0')
--------------------
tensor([1, 2, 3, 3], device='cuda:0')
tensor([0, 0, 0, 0], device='cuda:0')
--------------------
tensor([2, 0, 4, 2], device='cuda:0')
tensor([0, 0, 0, 0], device='cuda:0')
--------------------
tensor([4, 2, 4, 0], device='cuda:0')
tensor([0, 0, 0, 0], device='cuda:0')
--------------------
tensor([0, 0, 3, 3], device='cuda:0')
tensor([0, 0, 0, 0], device='cuda:0')
--------------------
...
...

实际上，模型总是预测相同的输出 [0.2270, 0.1855, 0.2131, 0.1877, 0.1867]对于任何输入，就好像它根本没有学到任何东西。

这很奇怪，因为我的数据集没有不平衡。

Counter({'politics': 417,
         'business': 510,
         'entertainment': 386,
         'tech': 401,
         'sport': 511})

最佳答案

经过一番挖掘，我发现，主要的罪魁祸首是学习率，用于微调 bert 0.001非常高。当我从 0.001 降低我的学习率时至 1e-5 ，我的训练和测试准确率都达到了 95%。

When BERT is fine-tuned, all layers are trained - this is quite different from fine-tuning in a lot of other ML models, but it matches what was described in the paper and works quite well (as long as you only fine-tune for a few epochs - it's very easy to overfit if you fine-tune the whole model for a long time on a small amount of data!)

源代码: https://github.com/huggingface/transformers/issues/587

Best result is found when all the layers are trained with a really small learning rate.

源代码: https://github.com/uzaymacar/comparatively-finetuning-bert

关于pytorch - Huggingface bert 表现出较差的准确性/f1 分数 [pytorch]，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/61969783/

文章推荐： scala - Scala/Chisel 中的函数循环

文章推荐： haskell - Haskell 中伴随仿函数的函数依赖

pthreads - sleep 表现
我正在用 C++ 开发一个程序，我必须实现一个 cron。由于不同的原因，这个 cron 应该每小时和每 24 小时执行一次。我的第一个想法是创建一个独立的 pthread 并在每次 1h 内休眠。这
javascript - 具有不同纹理的多个体素。表现
我需要向同一场景几何添加多个体素(立方体等于)，但每个体素具有不同的纹理。我的体素超过 500 个，导致性能出现严重错误。这是我的代码: texture = crearTextura(voxel.
mysql - 每个用户保存相似记录的单个表还是单独的表？ (表现？？)
对于 MySQL 数据库，我有 2 个场景，我不确定该选择哪一个，并且对于一些表我也遇到了同样的困境。我正在制作一个仅供成员(member)访问的网络应用程序。每个成员都有自己的交易、费用和“列表”
css - 我应该使用哪个？ (表现)
我想知道一个简单的事情: 当设置一个被所有 child 继承的样式时，是否建议最具体？ Structure: html > body > parent_content > wrapper > p 我想
c++ - 矩阵的乘法。表现
很难说出这里要问什么。这个问题模棱两可、含糊不清、不完整、过于宽泛或夸夸其谈，无法以目前的形式得到合理的回答。如需帮助澄清此问题以便重新打开，visit the help center . 关闭 1
java - JPA中的显式和隐式JOIN有什么区别？ (表现)
这些天我正在阅读有关 JPA 的内容。我了解到可以在 JPQL 中使用 explicit 或 implicit JOIN。显式加入 em.createQuery(“SELECT b.title, p
c# - 字符串连接与字符串生成器。表现
我有一种情况需要连接几个字符串以形成一个类的 id。基本上，我只是在列表中循环以获取对象的 ToString 值，然后将它们连接起来。 foreach (MyObject o in myList)
javascript - Canvas fillStyle 表现
我正在检查我的游戏在拖尾效果下的性能会降低多少。但我注意到每秒的操作次数更多了。这怎么可能？这是怎么回事... context.fillRect(0, 0, 500, 500); // cl
php - PHP 中的全局变量或传递变量？ (表现)
如果我可以选择使用全局变量或传递变量，哪个选项在速度和内存使用方面更好？ // global variable function func(){ global $var; echo $var;
mysql select 按主键排序。表现
我有一个类似这样的表“tbl”:ID bigint(20) - 主键，自增字段1字段2字段3 该表有 60 万多行。查询:SELECT * from tbl ORDER by ID LIMIT 60
algorithm - 旅行商 (TSP) 表现
谁能告诉我，我如何比较 TSP 最优和启发式算法？我已经实现了 TSP，但不知道如何比较它们。事实上，我怎样才能找到 TSP 的最优成本？有什么方法或猜测吗？谢谢最佳答案用众所周知的基准实例检查
ios - NSTextStorage 里面有长文本。表现
我有一个 NSTextStorage里面有长文本(比如一本书有 500 页，当前字体在设备上超过 9000 页)。我以这种方式为 textcontainer 分发此文本: let textStorag
c# - 按邮政编码查找产品 |半正弦算法 |表现
我有一个根据邮政编码搜索项目的应用程序。在搜索邮政编码时，我返回了来自该城市/社区的所有产品(通过解析邮政编码完成)。我现在需要根据与原始邮政编码的距离对这些产品进行分类。我将纬度/经度存储在数
performance - MPI Alltoallv或更好的个人Send and Recv？ (表现)
我有许多进程(大约100到1000个进程)，每个进程都必须向其他进程(例如大约10个)发送一些数据。 (通常，但不一定总是这样，如果A发送给B，B也发送给A。)每个进程都知道必须从哪个进程接收多少数据
performance - 带有 shouldComponentUpdate 的组件与无状态组件。表现？
我知道无状态组件使用起来更舒服(在特定场景下)，但是既然你不能使用shouldComponentUpdate，这是否意味着组件将在每次props更改时重新渲染？我的问题是，使用带有智能 shouldC
javascript - CSS/JS 即时缩小？ (表现)
我正在研究 Google Pagespeed 的加速页面加载时间指南列表。其中之一是缩小 CSS 和 JS 文件。由于这些文件经常更改，我正在考虑使用 PHP 脚本根据请求(来自浏览器)即时缩小此脚
MySQL 选择每个运动员的最佳(和最老)表现、类别
我正在尝试从下表构建 SQL 查询(示例): Example of table with name "performances" 这是带有运动表现的表格。我想从这个表中选择每个学科和一组一个或多个类别
c++ - 表现。寻找子串。 substr 与查找
假设我们有一个字符串 var "sA"，我想检查字符串 "123"是否在 sA 的末尾。什么更好，为什么: if(sA.length() > 2) sA.substr(sA.length()-3)
c# - Linq group by property 表现
关于受这篇文章启发的可参数化查询 LINQ group by property as a parameter我获得了一个很好的参数化查询，但在性能上有一个缺点。 public static void
c++ - 运算符(operator)表现|与运营商+
| 和| 之间有什么主要区别吗？和 + 从长远来看会影响代码的性能吗？或者都是 O(1)？我正在使用的代码是这样的: uint64_t dostuff(uint64_t a,uint64_t b){

行者123

个人简介

我是一名优秀的程序员,十分优秀！

作者热门文章

滴滴打车优惠券免费领取

全站热门文章

首页

博学

6Ren·AI

商城

pytorch - Huggingface bert 表现出较差的准确性/f1 分数 [pytorch]