作者热门文章
- html - 出于某种原因,IE8 对我的 Sass 文件中继承的 html5 CSS 不友好?
- JMeter 在响应断言中使用 span 标签的问题
- html - 在 :hover and :active? 上具有不同效果的 CSS 动画
- html - 相对于居中的 html 内容固定的 CSS 重复背景?
我想用 BertForMaskedLM 或 BertModel 来计算句子的困惑度,所以我写了这样的代码:
import numpy as np
import torch
import torch.nn as nn
from transformers import BertTokenizer, BertForMaskedLM
# Load pre-trained model (weights)
with torch.no_grad():
model = BertForMaskedLM.from_pretrained('hfl/chinese-bert-wwm-ext')
model.eval()
# Load pre-trained model tokenizer (vocabulary)
tokenizer = BertTokenizer.from_pretrained('hfl/chinese-bert-wwm-ext')
sentence = "我不会忘记和你一起奋斗的时光。"
tokenize_input = tokenizer.tokenize(sentence)
tensor_input = torch.tensor([tokenizer.convert_tokens_to_ids(tokenize_input)])
sen_len = len(tokenize_input)
sentence_loss = 0.
for i, word in enumerate(tokenize_input):
# add mask to i-th character of the sentence
tokenize_input[i] = '[MASK]'
mask_input = torch.tensor([tokenizer.convert_tokens_to_ids(tokenize_input)])
output = model(mask_input)
prediction_scores = output[0]
softmax = nn.Softmax(dim=0)
ps = softmax(prediction_scores[0, i]).log()
word_loss = ps[tensor_input[0, i]]
sentence_loss += word_loss.item()
tokenize_input[i] = word
ppl = np.exp(-sentence_loss/sen_len)
print(ppl)
我认为这段代码是对的,但我也注意到 BertForMaskedLM 的参数
masked_lm_labels
,那么我可以使用这个参数来计算一个句子的 PPL 更简单吗?
if masked_lm_labels is not None:
loss_fct = CrossEntropyLoss() # -100 index = padding token
masked_lm_loss = loss_fct(prediction_scores.view(-1, self.config.vocab_size),
masked_lm_labels.view(-1))
outputs = (masked_lm_loss,) + outputs
最佳答案
是的,您可以使用参数 labels
(或 masked_lm_labels
,我认为参数名称因 Huggingface 转换器的版本而异,无论如何)来指定掩码标记位置,并使用 -100
忽略您不想包含在损失计算中的 token 。
例如,
sentence='我爱你'
from transformers import BertTokenizer, BertForMaskedLM
import torch
import numpy as np
tokenizer = BertTokenizer(vocab_file='vocab.txt')
model = BertForMaskedLM.from_pretrained('bert-base-chinese')
tensor_input = tokenizer(sentence, return_tensors='pt')
# tensor([[ 101, 2769, 4263, 872, 102]])
repeat_input = tensor_input.repeat(tensor_input.size(-1)-2, 1)
# tensor([[ 101, 2769, 4263, 872, 102],
# [ 101, 2769, 4263, 872, 102],
# [ 101, 2769, 4263, 872, 102]])
mask = torch.ones(tensor_input.size(-1) - 1).diag(1)[:-2]
# tensor([[0., 1., 0., 0., 0.],
# [0., 0., 1., 0., 0.],
# [0., 0., 0., 1., 0.]])
masked_input = repeat_input.masked_fill(mask == 1, 103)
# tensor([[ 101, 103, 4263, 872, 102],
# [ 101, 2769, 103, 872, 102],
# [ 101, 2769, 4263, 103, 102]])
labels = repeat_input.masked_fill( masked_input != 103, -100)
# tensor([[-100, 2769, -100, -100, -100],
# [-100, -100, 4263, -100, -100],
# [-100, -100, -100, 872, -100]])
loss,_ = model(masked_input, masked_lm_labels=labels)
score = np.exp(loss.item())
功能:
def score(model, tokenizer, sentence, mask_token_id=103):
tensor_input = tokenizer.encode(sentence, return_tensors='pt')
repeat_input = tensor_input.repeat(tensor_input.size(-1)-2, 1)
mask = torch.ones(tensor_input.size(-1) - 1).diag(1)[:-2]
masked_input = repeat_input.masked_fill(mask == 1, 103)
labels = repeat_input.masked_fill( masked_input != 103, -100)
loss,_ = model(masked_input, masked_lm_labels=labels)
result = np.exp(loss.item())
return result
score(model, tokenizer, '我爱你') # returns 45.63794545581973
关于nlp - 如何使用 BertForMaskedLM 或 BertModel 计算句子的困惑度?,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/63030692/
我最近阅读了有关 Bert 的内容,并想将 BertForMaskedLM 用于 fill_mask 任务。我了解 Bert 架构。另外,据我所知,BertForMaskedLM 是从 Bert 构建
我想用 BertForMaskedLM 或 BertModel 来计算句子的困惑度,所以我写了这样的代码: import numpy as np import torch import torch.n
我是一名优秀的程序员,十分优秀!