huggingface-transformers - 将新 token 添加到 BERT/RoBERTa，同时保留相邻 token 的 token 化-6ren

huggingface-transformers - 将新 token 添加到 BERT/RoBERTa，同时保留相邻 token 的 token 化

转载作者：行者123 更新时间：2023-12-05 03:33:55

我正在尝试向 BERT 和 RoBERTa 标记器添加一些新标记，以便我可以根据新词微调模型。这个想法是用新词在一组有限的句子上微调模型，然后看看它在其他不同的上下文中对这个词的预测是什么，以检查模型对语言某些属性的知识状态。

为了做到这一点，我想添加新标记并将它们基本上视为新的普通词(模型还没有碰巧遇到过)。它们在添加后应该表现得与普通单词完全一样，除了它们的嵌入矩阵将被随机初始化，然后在微调期间学习。

但是，我在执行此操作时遇到了一些问题。特别是，在 BERT 的情况下，使用 do_basic_tokenize=False 初始化分词器时，围绕新添加的 token 的 token 的行为不符合预期(在 RoBERTa 的情况下，更改此设置似乎不会影响此处示例中的输出)。可以在以下示例中观察到该问题；在 BERT 的情况下，新添加的标记后的句点未被标记为子词(即，它被标记为 . 而不是预期的 ##.) , 在 RoBERTa 的情况下，新添加的子词后面的词被视为没有前面的空格(即，它被标记为 a 而不是 Ġa.

from transformers import BertTokenizer, RobertaTokenizer

new_word = 'mynewword'
bert = BertTokenizer.from_pretrained('bert-base-uncased', do_basic_tokenize = False)
bert.tokenize('mynewword') # does not exist yet
# ['my', '##ne', '##w', '##word']
bert.tokenize('testing.')
# ['testing', '##.']

bert.add_tokens(new_word)
bert.tokenize('mynewword') # now it does
# ['mynewword']
bert.tokenize('mynewword.')
# ['mynewword', '.']

roberta = RobertaTokenizer.from_pretrained('roberta-base', do_basic_tokenize = False)
roberta.tokenize('mynewword') # does not exist yet
# ['my', 'new', 'word']
roberta.tokenize('A testing a')
# ['A', 'Ġtesting', 'Ġa']

roberta.add_tokens(new_word)
roberta.tokenize('mynewword') # now it does
# ['mynewword']
roberta.tokenize('A mynewword a')
# ['A', 'mynewword', 'a']

有没有办法让我在添加新 token 的同时让周围 token 的行为与没有添加 token 时的行为相匹配？我觉得这很重要，因为模型最终可能会了解到(例如)新标记可以出现在 . 之前，而大多数其他标记只能出现在 ##. 之前这似乎会影响它的概括方式。此外，我可以在这里启用基本标记化来解决 BERT 问题，但这并不能真正反射(reflect)模型知识的完整状态，因为它破坏了不同标记之间的区别。这对 RoBERTa 问题没有帮助，无论如何它仍然存在。

此外，理想情况下，我能够将 RoBERTa 标记添加为 Ġmynewword，但我假设只要它永远不会作为句子中的第一个单词出现，就应该没关系。

最佳答案

在继续尝试解决这个问题之后，我似乎找到了一些可能有用的东西。它不一定是通用的，但可以从词汇文件(+ RoBERTa 的合并文件)加载分词器。如果您手动编辑这些文件以正确的方式添加新 token ，一切似乎都按预期工作。这是 BERT 的示例:

from transformers import BertTokenizer

bert = BertTokenizer.from_pretrained('bert-base-uncased', do_basic_tokenize=False)
bert.tokenize('testing.') # ['testing', '##.']
bert.tokenize('mynewword') # ['my', '##ne', '##w', '##word']

bert_vocab = bert.get_vocab() # get the pretrained tokenizer's vocabulary
bert_vocab.update({'mynewword' : len(bert_vocab)}) # add the new word to the end

with open('vocab.tmp', 'w', encoding = 'utf-8') as tmp_vocab_file:
    tmp_vocab_file.write('\n'.join(bert_vocab))
    
new_bert = BertTokenizer(name_or_path = 'bert-base-uncased', vocab_file = 'vocab.tmp', do_basic_tokenize=False)
new_bert.max_model_length = 512 # for identity to this setting on the pretrained one

new_bert.tokenize('mynewword') # ['mynewword']
new_bert.tokenize('mynewword.') # ['mynewword', '##.']

import os
os.remove('vocab.tmp') # cleanup

RoBERTa 更难，因为我们还必须将这些对添加到 merges.txt。我有一种适用于新标记的方法，但不幸的是它会影响作为新标记子部分的单词的标记化，所以它并不完美——如果有人用它来添加组成的单词(就像我使用的那样case)，你可以只选择不太可能导致问题的字符串(不像这里的例子'mynewword')，但在其他情况下它很可能会导致问题。 (虽然这不是一个完美的解决方案，但希望它能让其他人看到更好的解决方案。)

import re
import json
import requests
from transformers import RobertaTokenizer

roberta = RobertaTokenizer.from_pretrained('roberta-base')
roberta.tokenize('testing a') # ['testing', 'Ġa']
roberta.tokenize('mynewword') # ['my', 'new', 'word']

# update the vocabulary with the new token and the 'Ġ'' version
roberta_vocab = roberta.get_vocab()
roberta_vocab.update({'mynewword' : len(roberta_vocab)}) 
roberta_vocab.update({chr(288) + 'mynewword' : len(roberta_vocab)}) # chr(288) = 'Ġ'
with open('vocab.tmp', 'w', encoding = 'utf-8') as tmp_vocab_file:
    json.dump(roberta_vocab, tmp_vocab_file, ensure_ascii=False)

# get and modify the merges file so that the new token will always be tokenized as a single word
url = 'https://huggingface.co/roberta-base/resolve/main/merges.txt'
roberta_merges = requests.get(url).content.decode().split('\n')

# this is a helper function to loop through a list of new tokens and get the byte-pair encodings
# such that the new token will be treated as a single unit always
def get_roberta_merges_for_new_tokens(new_tokens):
    merges = [gen_roberta_pairs(new_token) for new_token in new_tokens]
    merges = [pair for token in merges for pair in token]
    return merges

def gen_roberta_pairs(new_token, highest = True):
    # highest is used to determine whether we are dealing with the Ġ version or not. 
    # we add those pairs at the end, which is only if highest = True
    
    # this is the hard part...
    chrs = [c for c in new_token] # list of characters in the new token, which we will recursively iterate through to find the BPEs
    
    # the simplest case: add one pair
    if len(chrs) == 2:
        if not highest: 
            return tuple([chrs[0], chrs[1]])
        else:
            return [' '.join([chrs[0], chrs[1]])]
    
    # add the tokenization of the first letter plus the other two letters as an already merged pair
    if len(chrs) == 3:
        if not highest:
            return tuple([chrs[0], ''.join(chrs[1:])])
        else:
            return gen_roberta_pairs(chrs[1:]) + [' '.join([chrs[0], ''.join(chrs[1:])])]
    
    if len(chrs) % 2 == 0:
        pairs = gen_roberta_pairs(''.join(chrs[:-2]), highest = False)
        pairs += gen_roberta_pairs(''.join(chrs[-2:]), highest = False)
        pairs += tuple([''.join(chrs[:-2]), ''.join(chrs[-2:])])
        if not highest:
            return pairs
    else:
        # for new tokens with odd numbers of characters, we need to add the final two tokens before the
        # third-to-last token
        pairs = gen_roberta_pairs(''.join(chrs[:-3]), highest = False)
        pairs += gen_roberta_pairs(''.join(chrs[-2:]), highest = False)
        pairs += gen_roberta_pairs(''.join(chrs[-3:]), highest = False)
        pairs += tuple([''.join(chrs[:-3]), ''.join(chrs[-3:])])
        if not highest:
            return pairs
    
    pairs = tuple(zip(pairs[::2], pairs[1::2]))
    pairs = [' '.join(pair) for pair in pairs]
    
    # pairs with the preceding special token
    g_pairs = []
    for pair in pairs:
        if re.search(r'^' + ''.join(pair.split(' ')), new_token):
            g_pairs.append(chr(288) + pair)
    
    pairs = g_pairs + pairs
    pairs = [chr(288) + ' ' + new_token[0]] + pairs
    
    pairs = list(dict.fromkeys(pairs)) # remove any duplicates
    
    return pairs

# first line of this file is a comment; add the new pairs after it
roberta_merges = roberta_merges[:1] + get_roberta_merges_for_new_tokens(['mynewword']) + roberta_merges[1:]
roberta_merges = list(dict.fromkeys(roberta_merges))
with open('merges.tmp', 'w', encoding = 'utf-8') as tmp_merges_file:
    tmp_merges_file.write('\n'.join(roberta_merges))

new_roberta = RobertaTokenizer(name_or_path='roberta-base', vocab_file='vocab.tmp', merges_file='merges.tmp')

# for some reason, we have to re-add the <mask> token to roberta if we are using it, since
# loading the tokenizer from a file will cause it to be tokenized as separate parts
# the weight matrix is identical, and once re-added, a fill-mask pipeline still identifies
# the mask token correctly (not shown here)
new_roberta.add_tokens(new_roberta.mask_token, special_tokens=True)
new_roberta.model_max_length = 512

new_roberta.tokenize('mynewword') # ['mynewword']
new_roberta.tokenize('mynewword a') # ['mynewword', 'Ġa']
new_roberta.tokenize(' mynewword') # ['Ġmynewword']

# however, this does not guarantee that tokenization of other words will not be affected
roberta.tokenize('mynew') # ['my', 'new']
new_roberta.tokenize('mynew') # ['myne', 'w']

import os
os.remove('vocab.tmp')
os.remove('merges.tmp') # cleanup

关于huggingface-transformers - 将新 token 添加到 BERT/RoBERTa，同时保留相邻 token 的 token 化，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/70255025/

文章推荐： foreach - 重构后如何保留 terraform 资源以使用 for_each？

文章推荐： linux - bash如何删除除一个目录外的所有多个扩展名的文件

文章推荐： python - Pandas :如何对列值拆分的子数据框应用操作

文章推荐： json - Proc json 在应用格式后产生额外的空白

android - 使用刷新 token 在访问 token 过期之前刷新访问 token
我正在开发一个应用程序，它使用 OAuth - 基于 token 的身份验证。考虑到我们拥有访问和刷新 token ，这就是流程的样子。 Api call -> intercepter append
python - 如何取消对 spacy.tokens.token.Token 的标记？
如何取消标记此代码的输出？类(class)核心: def __init__(self, user_input): pos = pop(user_input) subject = ""
kubernetes - kubectl --token=$TOKEN 没有使用 token 的权限运行
当我使用命令 kubectl 时与 --token标记并指定 token ，它仍然使用 kubeconfig 中的管理员凭据文件。这是我做的: NAMESPACE="default" SERVICE
security - 访问 token 和刷新 token 最佳实践？如何实现访问和刷新 token
我正在制作 SPA，并决定使用 JWT 进行身份验证/授权，并且我已经阅读了一些关于 Tokens 与 Cookies 的博客。我了解 cookie 授权的工作原理，并了解基本 token 授权的工作
azure - 请求刷新 token 失败。在 token 存储中找不到刷新 token
我正在尝试从应用服务获取 Google 的刷新 token ，但无法。日志说 2016-11-04T00:04:25 PID[500] Verbose Received request: GET h
java - token 语法错误 "(", ; token ","上的预期语法错误，； token ")"上的预期语法错误，；预期的
我正在开发一个项目，只是为了为 java 开发人员测试 eclipse IDE。我是java新手，所以我想知道为什么它不起作用，因为我已经知道该怎么做了。这是代码: public class ecli
asp.net - token 处理程序无法将 token 转换为 jwt token
我正在尝试使用 JwtSecurityTokenHandler 将 token 字符串转换为 jwt token 。但它出现错误说 IDX12709: CanReadToken() returned
android - Facebook 用户访问 token 与应用程序访问 token 与页面访问 token
我已阅读文档 Authentication (来自 Facebook 的官方)。我仍然不明白 Facebook 提供的这三种访问 token 之间的区别。网站上给出了一些例子，但我还是不太明白。每个
c# - 防伪 token 无法解密 & 防伪cookie token 和表单字段 token 在部署中不匹配
我的部署服务器有时有这个问题，这让我抓狂，因为我无法在本地主机中重现，我已经尝试在我的 web.config 中添加机器 key ，但没有成功远。它只发生在登录页面。我的布局:
c# - 如何在不创建新刷新 token 的情况下使用刷新 token 更新 Owin 访问 token ？
我已经设法获得了一个简单的示例代码，它可以创建一个不记名 token ，还可以通过阅读 stackoverflow 上的其他论坛来通过刷新 token 请求新的不记名 token 。启动类是这样的
php - Google Api，当我有访问 token 和以前的刷新 token 时如何刷新用户 token
如果我有以前的刷新 token 和使用纯 php 的访问 token ，没有 Google Api 库，是否可以刷新 Google Api token ？我在数据库中存储了许多用户刷新和访问 toke
java - token 无效 - 无效 token : Cannot parse referred token string: Invalid gaia_data. Base64 token 上的 AuthSubToken 原型(prototype)
我通过 Java 应用程序使用 Google 电子表格时遇到了问题。我创建了应用程序，该应用程序运行了 1 年多，没有任何问题，我什至在 Create Spreadsheet using Google
Keycloak admin REST API - 使用刷新 token 创建新的访问 token 而不重新创建刷新 token
当我有一个有效的刷新 token 时，我正在尝试使用 Keycloak admin REST API 重新创建访问 token 。我已经通过调用 POST/auth/realms/{realm}/p
wcf - 找不到 'System.IdentityModel.Tokens.UserNameSecurityToken' token 类型的 token 验证器。
我正在尝试让第三方 Java 客户端与我编写的 WCF 服务进行通信。收到消息时出现如下异常: Cannot find a token authenticator for the 'System.I
sql - 解析查询时出错。 [ token 行号=1， token 行偏移量=52， token 错误=)]
在尝试将数据插入到我的 SQl 数据库时，我收到以下错误 System.Data.SqlServerCe.SqlCeException: There was an error parsing the
access-token - JSON Web token (JWT) 相对于数据库 session token 的优势
使用数据库 session token 系统，我可以让用户使用用户名/密码登录，服务器可以生成 token (例如 uuid)并将其存储在数据库中并将该 token 返回给客户端。其上的每个请求都将包
azure - 错误: The received token is of incorrect token type -- What should the token look like?
我最近注册了 Microsoft Azure 并设置了认知服务帐户。使用 Text Translation API Documentation 中的说明我能够使用 interactive online
asp.net - 所提供的防伪 token 验证失败。 cookie token 和请求 token 已交换
我使用 IAntiforgery API 创建了一个 ASP.Net Core 2 应用程序。这提供了一种返回 cookie 的方法。客户端获取该 cookie，并在后续 POST 请求中将该值放
python - 基于 Spacy token 的匹配， token 之间的 token 数量为 'n'
我正在使用 spacy 来匹配某些文本(意大利语)中的特定表达式。我的文本可以多种形式出现，我正在尝试学习编写一般规则的最佳方式。我有如下 4 个案例，我想写一个适用于所有案例的通用模式。像这样的东西
javascript - OAuth 2.0 token 处理。是否有服务器 token 和客户端 token ？
我无法理解 oauth 2.0 token 的原则处理。我的场景是，我有一个基于 web 的前端后端系统，带有 node.js 和 angular 2。用户应该能够在此站点上上传视频。然后创建一些额

行者123

个人简介

我是一名优秀的程序员,十分优秀！

作者热门文章

滴滴打车优惠券免费领取

全站热门文章

首页

博学

6Ren·AI

商城

huggingface-transformers - 将新 token 添加到 BERT/RoBERTa，同时保留相邻 token 的 token 化