gpt4 book ai didi

tensorflow - 如何从 TF Hub 获取 Bert tokenizer 的 vocab 文件

转载 作者:行者123 更新时间:2023-12-05 07:15:12 25 4
gpt4 key购买 nike

我正在尝试使用来自 TensorFlow Hub 的 Bert 并构建一个分词器,这就是我正在做的:

>>> import tensorflow_hub as hub
>>> from bert.tokenization import FullTokenizer

>>> BERT_URL = 'https://tfhub.dev/tensorflow/bert_zh_L-12_H-768_A-12/1'
>>> bert_layer = hub.KerasLayer(BERT_URL, trainable=False)
WARNING:tensorflow:From /usr/local/lib/python3.6/dist-packages/tensorflow_core/python/ops/resource_variable_ops.py:1781: calling BaseResourceVariable.__init__ (from tensorflow.python.ops.resource_variable_ops) with constraint is deprecated and will be removed in a future version.
Instructions for updating:
If using Keras pass *_constraint arguments to layers.
WARNING:tensorflow:From /usr/local/lib/python3.6/dist-packages/tensorflow_core/python/ops/resource_variable_ops.py:1781: calling BaseResourceVariable.__init__ (from tensorflow.python.ops.resource_variable_ops) with constraint is deprecated and will be removed in a future version.
Instructions for updating:
If using Keras pass *_constraint arguments to layers.

但是现在当我检查已解析对象中的 vocab 文件时,我得到一个空张量

>>> bert_layer.resolved_object.vocab_file.asset_path.shape
TensorShape([])

获取此 vocab 文件的正确方法是什么?

最佳答案

试试这个:

FullTokenizer = bert.bert_tokenization.FullTokenizer
bert_layer = hub.KerasLayer("https://tfhub.dev/tensorflow/bert_en_uncased_L-12_H-768_A-12/1", trainable=False)

vocab_file = bert_layer.resolved_object.vocab_file.asset_path.numpy() #The vocab file of bert for tokenizer
tokenizer = FullTokenizer(vocab_file)

然后您可以使用分词器进行分词。

tokenizer.tokenize('Where are you going?') 

['w', '##hee', '##re', 'are', 'you', 'going', '?']

您还可以将其他函数传递到分词器中。例如:

do_lower_case = bert_layer.resolved_object.do_lower_case.numpy()
tokenizer = FullTokenizer(vocab_file, do_lower_case)
tokenizer.tokenize('Where are you going?')

['在哪里', '是', '你', '去', '?']

关于tensorflow - 如何从 TF Hub 获取 Bert tokenizer 的 vocab 文件,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/59654175/

25 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com