gpt4 book ai didi

bert-language-model - 如何向标记器添加新的特殊标记?

转载 作者:行者123 更新时间:2023-12-05 01:27:46 34 4
gpt4 key购买 nike

我想构建一个多类分类模型,我将对话数据作为 BERT 模型的输入(使用 bert-base-uncased)。

QUERY: I want to ask a question.
ANSWER: Sure, ask away.
QUERY: How is the weather today?
ANSWER: It is nice and sunny.
QUERY: Okay, nice to know.
ANSWER: Would you like to know anything else?

除此之外,我还有两个输入。

我想知道我是否应该在对话中加入特殊标记以使其对 BERT 模型更有意义,例如:

[CLS]QUERY: I want to ask a question. [EOT]
ANSWER: Sure, ask away. [EOT]
QUERY: How is the weather today? [EOT]
ANSWER: It is nice and sunny. [EOT]
QUERY: Okay, nice to know. [EOT]
ANSWER: Would you like to know anything else? [SEP]

但我无法添加新的 [EOT] 特殊 token 。
或者我应该为此使用 [SEP] token 吗?

编辑:重现步骤

from transformers import AutoTokenizer, AutoModelForSequenceClassification
tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")
model = AutoModelForSequenceClassification.from_pretrained("bert-base-uncased")

print(tokenizer.all_special_tokens) # --> ['[UNK]', '[SEP]', '[PAD]', '[CLS]', '[MASK]']
print(tokenizer.all_special_ids) # --> [100, 102, 0, 101, 103]

num_added_toks = tokenizer.add_tokens(['[EOT]'])
model.resize_token_embeddings(len(tokenizer)) # --> Embedding(30523, 768)

tokenizer.convert_tokens_to_ids('[EOT]') # --> 30522

text_to_encode = '''QUERY: I want to ask a question. [EOT]
ANSWER: Sure, ask away. [EOT]
QUERY: How is the weather today? [EOT]
ANSWER: It is nice and sunny. [EOT]
QUERY: Okay, nice to know. [EOT]
ANSWER: Would you like to know anything else?'''

enc = tokenizer.encode_plus(
text_to_encode,
max_length=128,
add_special_tokens=True,
return_token_type_ids=False,
return_attention_mask=False,
)['input_ids']

print(tokenizer.convert_ids_to_tokens(enc))

结果:

['[CLS]', 'query', ':', 'i', 'want', 'to', 'ask', 'a', 'question','.', '[', 'e', '##ot', ']', 'answer', ':', 'sure', ',', 'ask', 'away','.', '[', 'e', '##ot', ']', 'query', ':', 'how', 'is', 'the','weather', 'today', '?', '[', 'e', '##ot', ']', 'answer', ':', 'it','is', 'nice', 'and', 'sunny', '.', '[', 'e', '##ot', ']', 'query',':', 'okay', ',', 'nice', 'to', 'know', '.', '[', 'e', '##ot', ']','answer', ':', 'would', 'you', 'like', 'to', 'know', 'anything','else', '?', '[SEP]']

最佳答案

作为[SEP]的意图token 是作为两个句子之间的分隔符,它符合您使用 [SEP] 的目标用于分隔 QUERY 和 ANSWER 序列的标记。

您还尝试添加不同的标记以将 QUERY 或 ANSWER 的开始和结束标记为 <BOQ><EOQ>标记 QUERY 的开始和结束。同样,<BOA><EOA>标记 ANSWER 的开始和结束。

有时,使用现有标记比向词汇表中添加新标记效果更好,因为它需要大量的训练迭代以及学习新标记嵌入的数据。

但是,如果您的应用需要添加新 token ,则可以按如下方式添加:

num_added_toks = tokenizer.add_tokens(['[EOT]'], special_tokens=True) ##This line is updated
model.resize_token_embeddings(len(tokenizer))

###The tokenizer has to be saved if it has to be reused
tokenizer.save_pretrained(<output_dir>)

关于bert-language-model - 如何向标记器添加新的特殊标记?,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/69191305/

34 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com