gpt4 book ai didi

python - AutoTokenizer.from_pretrained 无法加载本地保存的预训练标记器 (PyTorch)

转载 作者:行者123 更新时间:2023-12-04 12:05:43 26 4
gpt4 key购买 nike

我是 PyTorch 的新手,最近,我一直在尝试使用 Transformers。我正在使用 HuggingFace 提供的预训练标记器。
我成功下载并运行它们。但是如果我尝试保存它们并再次加载,则会发生一些错误。如果我使用 AutoTokenizer.from_pretrained下载一个标记器,然后它就可以工作了。

[1]:    tokenizer = AutoTokenizer.from_pretrained('distilroberta-base')
text = "Hello there"
enc = tokenizer.encode_plus(text)
enc.keys()

Out[1]: dict_keys(['input_ids', 'attention_mask'])

但是如果我使用 tokenizer.save_pretrained("distilroberta-tokenizer") 保存它并尝试在本地加载它,然后它失败了。
[2]:    tmp = AutoTokenizer.from_pretrained('distilroberta-tokenizer')


---------------------------------------------------------------------------
OSError Traceback (most recent call last)
/opt/conda/lib/python3.7/site-packages/transformers/configuration_utils.py in get_config_dict(cls, pretrained_model_name_or_path, **kwargs)
238 resume_download=resume_download,
--> 239 local_files_only=local_files_only,
240 )

/opt/conda/lib/python3.7/site-packages/transformers/file_utils.py in cached_path(url_or_filename, cache_dir, force_download, proxies, resume_download, user_agent, extract_compressed_file, force_extract, local_files_only)
266 # File, but it doesn't exist.
--> 267 raise EnvironmentError("file {} not found".format(url_or_filename))
268 else:

OSError: file distilroberta-tokenizer/config.json not found

During handling of the above exception, another exception occurred:

OSError Traceback (most recent call last)
<ipython-input-25-3bd2f7a79271> in <module>
----> 1 tmp = AutoTokenizer.from_pretrained("distilroberta-tokenizer")

/opt/conda/lib/python3.7/site-packages/transformers/tokenization_auto.py in from_pretrained(cls, pretrained_model_name_or_path, *inputs, **kwargs)
193 config = kwargs.pop("config", None)
194 if not isinstance(config, PretrainedConfig):
--> 195 config = AutoConfig.from_pretrained(pretrained_model_name_or_path, **kwargs)
196
197 if "bert-base-japanese" in pretrained_model_name_or_path:

/opt/conda/lib/python3.7/site-packages/transformers/configuration_auto.py in from_pretrained(cls, pretrained_model_name_or_path, **kwargs)
194
195 """
--> 196 config_dict, _ = PretrainedConfig.get_config_dict(pretrained_model_name_or_path, **kwargs)
197
198 if "model_type" in config_dict:

/opt/conda/lib/python3.7/site-packages/transformers/configuration_utils.py in get_config_dict(cls, pretrained_model_name_or_path, **kwargs)
250 f"- or '{pretrained_model_name_or_path}' is the correct path to a directory containing a {CONFIG_NAME} file\n\n"
251 )
--> 252 raise EnvironmentError(msg)
253
254 except json.JSONDecodeError:

OSError: Can't load config for 'distilroberta-tokenizer'. Make sure that:

- 'distilroberta-tokenizer' is a correct model identifier listed on 'https://huggingface.co/models'

- or 'distilroberta-tokenizer' is the correct path to a directory containing a config.json file


它说“config.josn”从目录中丢失。在检查目录时,我得到了这些文件的列表:
[3]:    !ls distilroberta-tokenizer

Out[3]: merges.txt special_tokens_map.json tokenizer_config.json vocab.json

我知道这个问题之前已经发布过,但它们似乎都不起作用。我也试过关注 docs但仍然无法使其工作。
任何帮助,将不胜感激。

最佳答案

当前有一个 issue正在调查中,它只影响 AutoTokenizer,而不影响像 (RobertaTokenizer) 这样的底层标记器。例如,以下应该工作:

from transformers import RobertaTokenizer

tokenizer = RobertaTokenizer.from_pretrained('YOURPATH')
要使用 AutoTokenizer,您还需要保存配置以离线加载:
from transformers import AutoTokenizer, AutoConfig

tokenizer = AutoTokenizer.from_pretrained('distilroberta-base')
config = AutoConfig.from_pretrained('distilroberta-base')

tokenizer.save_pretrained('YOURPATH')
config.save_pretrained('YOURPATH')

tokenizer = AutoTokenizer.from_pretrained('YOURPATH')
我推荐给 要么 对标记器和模型使用不同的路径 保留模型的 config.json,因为您对模型应用的一些修改将存储在 model.save_pretrained() 期间创建的 config.json 中。并且当您在模型之后保存如上所述的分词器时将被覆盖(即您将无法使用分词器 config.json 加载修改后的模型)。

关于python - AutoTokenizer.from_pretrained 无法加载本地保存的预训练标记器 (PyTorch),我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/62472238/

26 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com