python - tf.keras.preprocessing.text.Tokenizer() 和 tfds.features.text.Tokenizer() 的比较-6ren

python - tf.keras.preprocessing.text.Tokenizer() 和 tfds.features.text.Tokenizer() 的比较

转载作者：行者123 更新时间：2023-12-05 00:46:15

32

4

作为一些背景知识，我最近越来越关注 NLP 和文本处理。我更熟悉计算机视觉。我完全理解标记化的想法。

我的困惑源于 Tokenizer 的各种实现。可以在 Tensorflow 中找到的 类 生态系统。

有一个 Tokenizer在 Tensorflow Datasets 中找到 类 ( tfds ) 以及 Tensorflow 中的一个 正确: tfds.features.text.Tokenizer() & tf.keras.preprocessing.text.Tokenizer() 分别。

我查看了源代码(链接如下)，但无法收集到任何有用的见解

这里的 tl;dr 问题是:您使用哪个库来做什么？一个库比另一个库有什么好处？

注意

我跟着 Tensorflow In Practice Specialization 以及这个 tutorial 。 TF in Practice Specialization 使用 tf.Keras.preprocessing.text.Tokenizer() 实现和文本加载教程使用 tfds.features.text.Tokenizer()

最佳答案

有许多包已经开始提供自己的 API 来进行文本预处理，但是，每个包都有自己的细微差别。

tf.keras.preprocessing.text.Tokenizer() 由 Keras 实现，并作为高级 API 被 Tensorflow 支持。

tfds.features.text.Tokenizer() 由 tensorflow 自己开发和维护。

两者都有自己的方式来对 token 进行编码。您可以通过下面的示例进行说明。

import tensorflow as tf
from tensorflow.keras.preprocessing.text import Tokenizer
import tensorflow_datasets as tfds

让我们获取一些样本数据并查看这两个 API 的编码输出:

text_data = ["4. Kurt Betschart - Bruno Risi ( Switzerland ) 22",
            "Israel approves Arafat 's flight to West Bank .",
            "Moreau takes bronze medal as faster losing semifinalist .",
            "W D L G / F G / A P",
            "-- Helsinki newsroom +358 - 0 - 680 50 248",
            "M'bishi Gas sets terms on 7-year straight ."]

首先，让我们看看 tf.keras.Tokenizer() 的结果:

tf_keras_tokenizer = Tokenizer()
tf_keras_tokenizer.fit_on_texts(text_data)
tf_keras_encoded = tf_keras_tokenizer.texts_to_sequences(text_data)
tf_keras_encoded = pad_sequences(tf_keras_encoded, padding="post")

对于我们输入数据中的第一句话，结果将是:

tf_keras_encoded[0]

array([2, 3, 4, 5, 6, 7, 8, 0], dtype=int32)

如果我们查看单词到索引的映射。

tf_keras_tokenizer.index_word  


{1: 'g',
 2: '4',
 3: 'kurt',
 4: 'betschart',
 5: 'bruno',
 6: 'risi',
 7: 'switzerland',
 8: '22',
 9: 'israel',
 10: 'approves',
 11: 'arafat',
 12: "'s",
 13: 'flight',
 14: 'to',
 15: 'west',
 16: 'bank',
 17: 'moreau',
 18: 'takes',
 19: 'bronze',
 20: 'medal',
 21: 'as',
 22: 'faster',
 23: 'losing',
 24: 'semifinalist',
 25: 'w',
 26: 'd',
 27: 'l',
 28: 'f',
 29: 'a',
 30: 'p',
 31: 'helsinki',
 32: 'newsroom',
 33: '358',
 34: '0',
 35: '680',
 36: '50',
 37: '248',
 38: "m'bishi",
 39: 'gas',
 40: 'sets',
 41: 'terms',
 42: 'on',
 43: '7',
 44: 'year',
 45: 'straight'}

现在让我们试试 tfds.features.text.Tokenizer():

text_vocabulary_set = set()
for text in text_data:
    text_tokens = tfds_tokenizer.tokenize(text)
    text_vocabulary_set.update(text_tokens) 

tfds_text_encoder = tfds.features.text.TokenTextEncoder(text_vocabulary_set, tokenizer=tfds_tokenizer)

对于我们输入数据中的第一句话，结果将是:

tfds_text_encoder.encode(text_data[0])

[35, 19, 44, 38, 32, 2, 14]

如果我们查看单词到索引的映射(注意索引从 0 开始)。

tfds_text_encoder._token_to_id  

{'0': 0,
 '22': 13,
 '248': 17,
 '358': 23,
 '4': 34,
 '50': 9,
 '680': 6,
 '7': 26,
 'A': 19,
 'Arafat': 39,
 'Bank': 35,
 'Betschart': 43,
 'Bruno': 37,
 'D': 15,
 'F': 20,
 'G': 28,
 'Gas': 29,
 'Helsinki': 38,
 'Israel': 3,
 'Kurt': 18,
 'L': 44,
 'M': 5,
 'Moreau': 22,
 'P': 10,
 'Risi': 31,
 'Switzerland': 1,
 'W': 30,
 'West': 33,
 'approves': 4,
 'as': 7,
 'bishi': 2,
 'bronze': 12,
 'faster': 8,
 'flight': 27,
 'losing': 42,
 'medal': 32,
 'newsroom': 11,
 'on': 25,
 's': 24,
 'semifinalist': 40,
 'sets': 36,
 'straight': 45,
 'takes': 41,
 'terms': 16,
 'to': 14,
 'year': 21}

您可以看到两个结果中的编码差异以及两个 API 都提供了一些可以根据需要使用和更改的超参数。

关于python - tf.keras.preprocessing.text.Tokenizer() 和 tfds.features.text.Tokenizer() 的比较，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/61661160/

32

4

0

文章推荐： javascript - 生成 18 位唯一编号

文章推荐： pharo - 现有的库类应该如何扩展？

文章推荐： javascript - traverse.clearCache 不是函数

process - BDD features of features，我应该制作一个新故事还是属于某个场景？
好的，所以我刚刚开始尝试将 BDD 用于我们正在进行的一些新开发，并且我为日志查看器功能写了一个这样的故事: 故事:用户查看工作流执行日志 As a user I want to review the
python - 值错误 : Feature not in features dictionary
我正在尝试使用 TensorFlow 编写一个简单的深度机器学习模型。我正在使用我在 Excel 中制作的玩具数据集，只是为了让模型工作并接受数据。我的代码如下: import pandas as p
python - 机器学习: combining features into single feature
我是机器学习的初学者。我很困惑如何将数据集的不同特征组合成一个特征。例如，我在 Python Pandas 数据框架中有一个数据集，其特征如下: movie unknown actio
language-features - 语言和 VM : Features that are hard to optimize and why
我正在做一项功能调查，为一个研究项目做准备。说出难以优化的主流语言或语言功能，以及为什么该功能值得或不值得付出代价，或者只是用轶事证据驳斥我下面的理论。在有人将其标记为主观之前，我要求提供语言或功能
release - 哪个更好 : shipping a buggy feature or not shipping the feature at all?
这是一个有点哲学问题。我正在为我的软件添加一个小功能，我认为大多数用户都会使用它，但他们使用该软件的次数可能只有 10%。换句话说，该软件没有它 3 个月就很好，但是有 4 或 5 个用户要求它，我同
Git 流 : Can I publish a feature more than once before I finish the feature?
我开始使用 git flow。我创建了一个功能: git flow feature start eval 然后我做了一些工作并添加并提交了更改: git add (files) git commit
git - pull 请求是 "Git Feature"还是 GitHub Feature”？
pull 请求是内置在 Git 中还是 GitHub 虚构的概念？最佳答案概念和该概念的实现之间存在区别。 “请求 pull ”的概念是 DVCS 系统有别于传统版本控制系统的部分原因。使用传统的
feature-selection - 计算机视觉中的 "Bag of Words"和 "Bag of features"有什么区别？
研究该主题，可以找到作者使用“词袋”模型进行图像分类/检索的论文，而其他人则使用“特征袋”模型进行类似任务。尽管我对所涉及的方法有基本的了解(检测和提取视觉词、构建视觉词典、使用机器学习训练分类器)
ruby-on-rails - 如何建模 "Featuring"的概念(即，当艺术家在一首歌曲中为 "featured"时)
有时一首歌会有不止一个艺术家。例如，Jay-z 的新歌“A Star is Born”以艺术家 Cole 为主角，因此在目录中会被列为“Jay-z(以 Cole 为主角)- A Star is Bor
rust - Cargo.toml : how do I select a dependency's feature based on my crate's features?
This question already has an answer here: How do I 'pass down' feature flags to subdependencies in C
numpy - sklearn : get feature names after L1-based feature selection
This question and answer演示当使用 scikit-learn 的专用特征选择例程之一执行特征选择时，可以按如下方式检索所选特征的名称: np.asarray(vectorize
rust cargo : how to use different features for a dep when a particular feature is enabled?
例如，我定义了 2 个没有依赖关系的特性: [features] default = [] py2 = [] py3 = [] 基于选定的功能 (--features py3) 我想为依赖项 (cpy
php - 帮助 Wordpress 站点自定义 'Featured Img Size' & 'Non-Featured'
我正在完成一个小型 Wordpress“杂志”类型网站的定制。由于我是 PHP 的新手，我遇到了一些需要帮助的问题。我有一个“首屏，主要特色区域，包含 3 张图片”和帖子标题的小摘录。在首屏下，我在
c# - 一个用户在 "Apps & Features"和 "Programs & Features"中的应用可见性，但对另一个用户不可见
我已经为 Windows 10 创建了一个 C# 应用程序。它是通过使用 WIX 生成的 MSI 安装的。但是，当它为一台机器上的一个用户安装时，并非出于我的意图，它不会为同一台机器上的其他用户安装。
java - ArcGIS 运行时 : How to identify the topmost feature across all feature layers?
在 ArcGIS Runtime Java API 文档中，有一个 identifyLayersAsync() method . 来自文档: Asynchronously identifies the
Git 流 : Do you have to manually delete the feature branches from remote after finishing the feature?
我是 GIT 和 GIT-Flow 的新手。 [在我的 python-django 项目上] 我做了什么: git flow feature start new_feature # perform s
angular - 属性 'features' 在类型 'Feature' 上不存在
我是 Angular 的新手，我正在尝试使用 Angular/d3 构建德国 map 。 map 数据存储在 Topojson 文件 plz_map_ger.json 中: { "type": "To
rest - 当端点被 feature-flag/feature-toggle 禁用时，您使用什么 HTTP 状态代码？
我一直在使用 503 服务不可用或停机维护。但是一些 http 客户端库，即 axios 将 503 视为可重试错误。如果由于高负载而产生响应，则重试它是有意义的，但 503 也适合功能切换情况
maven - karaf 的 features-maven-plugin generate-features-xml 目标的包属性的格式是什么
要列出您希望包含在生成的 features.xml 中的一堆包，文档说: bundles File A properties file that contains a list of bund
c# - 错误 "A template containing a class feature must end with a class feature"
我在 Visual Studio 2010 下开发 C# T4 预处理模板时遇到以下编译错误: A template containing a class feature must end with

首页

博学

6Ren·AI

商城

python - tf.keras.preprocessing.text.Tokenizer() 和 tfds.features.text.Tokenizer() 的比较