gpt4 book ai didi

nlp - BERT 模型 : "enable_padding() got an unexpected keyword argument ' max_length'"

转载 作者:行者123 更新时间:2023-12-05 06:03:09 26 4
gpt4 key购买 nike

我正在尝试使用 Hugging Face 和 KERAS 实现 BERT 模型架构。我正在从 Kaggle (link) 中学习并尝试理解它。当我标记我的数据时,我遇到了一些问题并收到一条错误消息。错误消息是:

---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
<ipython-input-20-888a40c0160b> in <module>
----> 1 x_train = fast_encode(train1.comment_text.astype(str), fast_tokenizer, maxlen=MAX_LEN)
2 x_valid = fast_encode(valid.comment_text.astype(str), fast_tokenizer, maxlen=MAX_LEN)
3 x_test = fast_encode(test.content.astype(str), fast_tokenizer, maxlen=MAX_LEN )
4 y_train = train1.toxic.values
5 y_valid = valid.toxic.values

<ipython-input-8-de591bf0a0b9> in fast_encode(texts, tokenizer, chunk_size, maxlen)
4 """
5 tokenizer.enable_truncation(max_length=maxlen)
----> 6 tokenizer.enable_padding(max_length=maxlen)
7 all_ids = []
8

TypeError: enable_padding() got an unexpected keyword argument 'max_length'

代码是:

x_train = fast_encode(train1.comment_text.astype(str), fast_tokenizer, maxlen=192)
x_valid = fast_encode(valid.comment_text.astype(str), fast_tokenizer, maxlen=192)
x_test = fast_encode(test.content.astype(str), fast_tokenizer, maxlen=192 )
y_train = train1.toxic.values
y_valid = valid.toxic.values

函数 fast_encode 在这里:

def fast_encode(texts, tokenizer, chunk_size=256, maxlen=512):
"""
Encoder for encoding the text into sequence of integers for BERT Input
"""
tokenizer.enable_truncation(max_length=maxlen)
tokenizer.enable_padding(max_length=maxlen)
all_ids = []

for i in tqdm(range(0, len(texts), chunk_size)):
text_chunk = texts[i:i+chunk_size].tolist()
encs = tokenizer.encode_batch(text_chunk)
all_ids.extend([enc.ids for enc in encs])

return np.array(all_ids)

我现在该怎么办?

最佳答案

这里使用的分词器不是常规的分词器,而是旧版本的 Huggingface tokenizer 库提供的快速分词器。

如果您希望使用笔记本中旧版本的 huggingface transformers 创建快速分词器,您可以这样做:

from tokenizers import BertWordPieceTokenizer

# First load the real tokenizer
tokenizer = transformers.DistilBertTokenizer.from_pretrained('distilbert-base-multilingual-cased')
# Save the loaded tokenizer locally
tokenizer.save_pretrained('.')
# Reload it with the huggingface tokenizers library
fast_tokenizer = BertWordPieceTokenizer('vocab.txt', lowercase=False)
fast_tokenizer

但是,自从我编写了这段代码后,使用快速分词器的过程就变得简单多了。如果您查看 Proprocessing data tutorial通过 Huggingface,你会发现你只需要做:

tokenizer = AutoTokenizer.from_pretrained('bert-base-cased')

batch_sentences = [
"Hello world",
"Some slightly longer sentence to trigger padding"
]
batch = tokenizer(batch_sentences, padding=True, truncation=True, return_tensors="tf")

这是因为快速分词器(用 Rust 编写)会在可用时自动使用。

关于nlp - BERT 模型 : "enable_padding() got an unexpected keyword argument ' max_length'",我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/66743649/

26 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com