gpt4 book ai didi

python - 变形金刚 : Asking to pad but the tokenizer does not have a padding token

转载 作者:行者123 更新时间:2023-12-02 01:49:01 28 4
gpt4 key购买 nike

尝试使用相同的数据集依次评估一堆 transformers 模型,以检查哪个模型表现更好。

模型列表是这个:

MODELS = [
('xlm-mlm-enfr-1024' ,"XLMModel"),
('distilbert-base-cased', "DistilBertModel"),
('bert-base-uncased' ,"BertModel"),
('roberta-base' ,"RobertaModel"),
("cardiffnlp/twitter-roberta-base-sentiment","RobertaSentTW"),
('xlnet-base-cased' ,"XLNetModel"),
#('ctrl' ,"CTRLModel"),
('transfo-xl-wt103' ,"TransfoXLModel"),
('bert-base-cased' ,"BertModelUncased"),
('xlm-roberta-base' ,"XLMRobertaModel"),
('openai-gpt' ,"OpenAIGPTModel"),
('gpt2' ,"GPT2Model")

在“ctrl”模型之前,它们都工作正常,它返回此错误:

请求填充,但分词器没有填充 token 。请选择一个代币用作“pad_token”“(tokenizer.pad_token = tokenizer.eos_token e.g.)”或通过“tokenizer.add_special_tokens({'pad_token': '[PAD]'})”添加新的 pad 代币。

对我的数据集的句子进行分词时。

分词代码是

SEQ_LEN = MAX_LEN #(50)

for pretrained_weights, model_name in MODELS:

print("***************** INICIANDO " ,model_name,", weights ",pretrained_weights, "********* ")
print("carganzo el tokenizador ()")
tokenizer = AutoTokenizer.from_pretrained(pretrained_weights)
print("creando el modelo preentrenado")
transformer_model = TFAutoModel.from_pretrained(pretrained_weights)
print("aplicando el tokenizador al dataset")

##APLICAMOS EL TOKENIZADOR##

def tokenize(sentence):

tokens = tokenizer.encode_plus(sentence, max_length=MAX_LEN,
truncation=True, padding='max_length',
add_special_tokens=True, return_attention_mask=True,
return_token_type_ids=False, return_tensors='tf')
return tokens['input_ids'], tokens['attention_mask']

# initialize two arrays for input tensors
Xids = np.zeros((len(df), SEQ_LEN))
Xmask = np.zeros((len(df), SEQ_LEN))

for i, sentence in enumerate(df['tweet']):
Xids[i, :], Xmask[i, :] = tokenize(sentence)
if i % 10000 == 0:
print(i) # do this so we can see some progress


arr = df['label'].values # take label column in df as array

labels = np.zeros((arr.size, arr.max()+1)) # initialize empty (all zero) label array
labels[np.arange(arr.size), arr] = 1 # add ones in indices where we have a value`

我曾尝试按照解决方案告诉我的那样定义填充标记,但随后出现此错误

could not broadcast input array from shape (3,) into shape (50,)

排队

Xids[i, :], Xmask[i, :] = tokenize(sentence)

我也试过this solution并且都不起作用。

如果你能读到这里,谢谢。

需要任何帮助。

最佳答案

您可以使用 add_special_tokens API 添加 [PAD] token 。

tokenizer = AutoTokenizer.from_pretrained(pretrained_weights)
if tokenizer.pad_token is None:
tokenizer.add_special_tokens({'pad_token': '[PAD]'})

关于python - 变形金刚 : Asking to pad but the tokenizer does not have a padding token,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/70544129/

28 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com