tensorflow - 如何从 TF Hub 获取 Bert tokenizer 的 vocab 文件-6ren

tensorflow - 如何从 TF Hub 获取 Bert tokenizer 的 vocab 文件

转载作者：行者123 更新时间：2023-12-05 07:15:12

25

4

我正在尝试使用来自 TensorFlow Hub 的 Bert 并构建一个分词器，这就是我正在做的:

>>> import tensorflow_hub as hub
>>> from bert.tokenization import FullTokenizer

>>> BERT_URL = 'https://tfhub.dev/tensorflow/bert_zh_L-12_H-768_A-12/1'
>>> bert_layer = hub.KerasLayer(BERT_URL, trainable=False)
WARNING:tensorflow:From /usr/local/lib/python3.6/dist-packages/tensorflow_core/python/ops/resource_variable_ops.py:1781: calling BaseResourceVariable.__init__ (from tensorflow.python.ops.resource_variable_ops) with constraint is deprecated and will be removed in a future version.
Instructions for updating:
If using Keras pass *_constraint arguments to layers.
WARNING:tensorflow:From /usr/local/lib/python3.6/dist-packages/tensorflow_core/python/ops/resource_variable_ops.py:1781: calling BaseResourceVariable.__init__ (from tensorflow.python.ops.resource_variable_ops) with constraint is deprecated and will be removed in a future version.
Instructions for updating:
If using Keras pass *_constraint arguments to layers.

但是现在当我检查已解析对象中的 vocab 文件时，我得到一个空张量

>>> bert_layer.resolved_object.vocab_file.asset_path.shape
TensorShape([])

获取此 vocab 文件的正确方法是什么？

最佳答案

试试这个:

FullTokenizer = bert.bert_tokenization.FullTokenizer
bert_layer = hub.KerasLayer("https://tfhub.dev/tensorflow/bert_en_uncased_L-12_H-768_A-12/1", trainable=False)

vocab_file = bert_layer.resolved_object.vocab_file.asset_path.numpy() #The vocab file of bert for tokenizer
tokenizer = FullTokenizer(vocab_file)

然后您可以使用分词器进行分词。

tokenizer.tokenize('Where are you going?')

['w', '##hee', '##re', 'are', 'you', 'going', '?']

您还可以将其他函数传递到分词器中。例如:

do_lower_case = bert_layer.resolved_object.do_lower_case.numpy()
tokenizer = FullTokenizer(vocab_file, do_lower_case) 
tokenizer.tokenize('Where are you going?')

['在哪里', '是', '你', '去', '?']

关于tensorflow - 如何从 TF Hub 获取 Bert tokenizer 的 vocab 文件，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/59654175/

25

4

0

文章推荐： git - 在自托管 git repo 上使用 git pull 请求的工作流程

文章推荐： node.js - Gatsby 的环境变量 "env.cmd not found"

文章推荐： Azure应用服务容器不断重新启动

spacy - 如何合并 spacy Vocab 实例？
当处理数百万文档并将它们保存为空间文档以供以后使用(更多处理、可视化、提取特征)时，一种明显的扩展解决方案是并行/分布式处理。这意味着每个并行进程都将拥有自己的 Vocab，这些 Vocab 可能会随
python - BucketIterator 抛出 'Field' 对象没有属性 'vocab'
这不是一个新问题，我找到的引用文献没有任何适合我的解决方案 first和 second .我是 PyTorch 的新手，面对 AttributeError: 'Field' object has no
python - 为 spacy 解析器创建的每个标记获取 Spacy.Vocab.Morphology id
下面的代码是获取每个句子的文档的示例代码。 Get docs 为了获取每个单词的属性，我们使用文档，示例代码如下。 Get Tokens 通过深入挖掘 spacy 代码，我发现对于每个名词，我们确实有
python - TfIdfVectorizer : How does the vectorizer with fixed vocab deal with new words?
我正在处理约 10 万篇研究论文的语料库。我正在考虑三个领域: 明文标题摘要我使用 TfIdfVectorizer 获取明文字段的 TfIdf 表示，并将由此产生的词汇反馈回标题和摘要的 Vec
rdf - JSON-LD 中@vocab 的用途是什么，与@context 有什么区别？
什么是@vocab JSON-LD 中的属性？正如我所见，您可以“导入”远程词汇表，但这与您可以使用 @context 做的事情不一样吗？ ?如果我没有错，那么您可以为 @context“导入”远程源
tensorflow - 如何从 TF Hub 获取 Bert tokenizer 的 vocab 文件
我正在尝试使用来自 TensorFlow Hub 的 Bert 并构建一个分词器，这就是我正在做的: >>> import tensorflow_hub as hub >>> from bert.to
bert-language-model - vocab 大小必须精确计算 bert_config.json 中的 vocab_size 吗？
我看到别人的BERT模型，其中vocab.txt的大小是22110，但是bert_config.json中vocab_size参数的值为21128。我明白这两个数字一定是完全一样的。是吗？最佳答案
microdata - How to use multiple Vocabularies with HTML5 Microdata (different vocab than schema.org)
假设我有这个有效的微数据增强 HTML 片段: Example (我已经用谷歌的结构化数据测试工具对此进行了测试。) 现在我想添加一个在 schema.org 中不可用但在不同词汇中的属性(在
python - 如何使用我自己的语料库文本创建和拟合 vocab.bpe 文件(GPT 和 GPT2 OpenAI 模型)？
此问题适合那些熟悉 GPT 或 GPT2 的人OpenAI 模型。特别是编码任务(字节对编码)。这是我的问题: 我想知道如何创建自己的 vocab.bpe 文件。我有一个西类牙语语料库文本，我想用它
python - 迭代 Torchtext.data.BucketIterator 对象抛出 AttributeError 'Field' 对象没有属性 'vocab'
当我尝试查看批处理时，通过打印 BucketIterator 对象的下一次迭代，抛出了 AttributeError。 tv_datafields=[("Tweet",TEXT), ("Anger",
python-3.x - 我正在执行代码 : nlp. vocab ['Hun' ].vector 并获得 ValueError : [E010] Word vectors set to length 0
我正在执行的所有代码是: from __future__ import unicode_literals import spacy from spacy.vocab import Vocab nlp

首页

博学

6Ren·AI

商城

tensorflow - 如何从 TF Hub 获取 Bert tokenizer 的 vocab 文件