python - tf-idf 向量化器在 char_wb 的特征词中有空格？-6ren

python - tf-idf 向量化器在 char_wb 的特征词中有空格？

转载作者：行者123 更新时间：2023-12-01 08:25:09

25

4

我用

singleTFIDF = TfidfVectorizer(
    analyzer='char_wb', 
    ngram_range=(4,6),
    stop_words=my_stop_words, 
    max_features=50
).fit([text])

并且想知道为什么我的功能中有空格，例如“chaft”

如何避免这种情况？我需要自己对其进行标记化和预处理吗？

最佳答案

使用analyzer='word'。

当我们使用analyzer='char_wb'时，矢量化器会填充空格，因为它不会针对单词进行标记；它针对字符进行标记。

根据documentation对于分析器参数:

analyzer{‘word’, ‘char’, ‘char_wb’} or callable, default=’word’

Whether the feature should be made of word or character n-grams. Option ‘char_wb’ creates character n-grams only from text inside word boundaries; n-grams at the edges of words are padded with space.

看下面的例子:

from sklearn.feature_extraction.text import TfidfVectorizer
corpus = [
    'This is the first document.',
    'This document is the second document.',
    'And this is the third one.',
    'Is this the first document?',
]
vectorizer = TfidfVectorizer(
  analyzer='char_wb', 
  ngram_range= (4,6))
X = vectorizer.fit_transform(corpus)
print([(len(w),w) for w in vectorizer.get_feature_names()])

[(4, ' and'), (5, ' and '), (4, ' doc'), (5, ' docu'), (6, ' docum'),(4, ' fir'), (5, ' firs'), (6, ' first'), (4, ' is '), (4, ' one'),(5, ' one.'), (6, ' one. '), (4, ' sec'), (5, ' seco'), (6, ' secon'),(4, ' the'), (5, ' the '), (4, ' thi'), (5, ' thir'), (6, ' third'),(5, ' this'), (6, ' this '), (4, 'and '), (4, 'cond'), (5, 'cond '),(4, 'cume'), (5, 'cumen'), (6, 'cument'), (4, 'docu'), (5, 'docum'),(6, 'docume'), (4, 'econ'), (5, 'econd'), (6, 'econd '), (4, 'ent '),(4, 'ent.'), (5, 'ent. '), (4, 'ent?'), (5, 'ent? '), (4, 'firs'), (5,'first'), (6, 'first '), (4, 'hird'), (5, 'hird '), (4, 'his '), (4,'ird '), (4, 'irst'), (5, 'irst '), (4, 'ment'), (5, 'ment '), (5,'ment.'), (6, 'ment. '), (5, 'ment?'), (6, 'ment? '), (4, 'ne. '), (4,'nt. '), (4, 'nt? '), (4, 'ocum'), (5, 'ocume'), (6, 'ocumen'), (4,'ond '), (4, 'one.'), (5, 'one. '), (4, 'rst '), (4, 'seco'), (5,'secon'), (6, 'second'), (4, 'the '), (4, 'thir'), (5, 'third'), (6,'third '), (4, 'this'), (5, 'this '), (4, 'umen'), (5, 'ument'), (6,'ument '), (6, 'ument.'), (6, 'ument?')]

注意:

输出/功能包括'this'(在开头填充了原始文本中不存在的额外空格；句子以开头'这个')
输出/功能包括'ment。 '(在末尾添加了原文中没有的额外空格；句子以 'document.' 结尾)
输出/特征不包括'is the'，因为该n-gram跨越单词边界，但'char_wb' 分析器仅创建“单词边界内”的 n 元语法

关于python - tf-idf 向量化器在 char_wb 的特征词中有空格？，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/54308898/

25

4

0

文章推荐： python + 如何知道谁删除了python模块

文章推荐： python - 为python程序生成突变体

文章推荐： jquery - jQuery.expr 代码放在哪里？

Tensorflow - 为什么 tf.nn 和 tf.layers/tf.losses/tf.contrib.layers 等中有这么多相似甚至重复的函数？
在 Tensorflow(从 v1.2.1 开始)中，似乎有(至少)两个并行 API 来构建计算图。 tf.nn 中有函数，如 conv2d、avg_pool、relu、dropout，tf.laye
python - tf.reduce_sum(lastconv,axis=2)/tf.reduce_sum(tf.cast(tf.greater(lastconv, 0), tf.float32), axis=2) 用于代替均值池？
我正在处理眼睛轨迹数据和卷积神经网络。我被要求使用 tf.reduce_max(lastconv, axis=2)代替 MaxPooling 层和 tf.reduce_sum(lastconv,axi
python - 什么时候应该使用 tf.train.BytesList、tf.train.FloatList 和 tf.train.Int64List 将数据存储在 tf.train.Feature 中？
TensorFlow 提供了 3 种不同的数据存储格式 tf.train.Feature .它们是: tf.train.BytesList tf.train.FloatList tf.train.In
python - tf.contrib.layer.fully_connected、tf.layers.dense、tf.contrib.slim.fully_connected、tf.keras.layers.Dense 之间的不一致
我正在尝试为上下文强盗问题 (https://medium.com/emergent-future/simple-reinforcement-learning-with-tensorflow-part
python - 如何将 tf.layers 变量放入 tf.name_scope/tf.variable_scope 中？
我在使用 Tensorflow 时遇到问题: 以下代码为卷积 block 生成正确的图: def conv_layer(self, inputs, filter_size = 3, num_filte
python - TF 2.0 中的 tf.GradientTape 是否等同于 tf.gradients？
我正在将我的训练循环迁移到 Tensorflow 2.0 API .在急切执行模式下，tf.GradientTape替换 tf.gradients .问题是，它们是否具有相同的功能？具体来说: 在函数
python - tf.control_dependencies(tf.get_collection(tf.GraphKeys.UPDATE_OPS)) 在 tensorflow 中
tensorflow 中 tf.control_dependencies(tf.get_collection(tf.GraphKeys.UPDATE_OPS)) 的目的是什么？更多上下文:
tensorflow - tf.square、tf.math.square 和 tf.keras.backend.square 之间有什么区别？
我一直在努力学习 TensorFlow，我注意到不同的函数用于相同的目标。例如，为了平方变量，我看到了 tf.square()、tf.math.square() 和 tf.keras.backend.
tensorflow - tf.data 或 tf.keras.utils.Sequence。提高 tf.data 的效率？
我正在尝试使用自动编码器开发图像着色器。有 13000 张训练图像。如果我使用 tf.data，每个 epoch 大约需要 45 分钟，如果我使用 tf.utils.keras.Sequence 大约
logging - tensorflow log_softmax tf.nn.log(tf.nn.softmax(predict)) tf.nn.softmax_cross_entropy_with_logits
我尝试按照 tensorflow 教程实现 MNIST CNN 神经网络，并找到这些实现 softmax 交叉熵的方法给出了不同的结果: (1) 不好的结果 softmax = tf.nn.softm
python - `tf.reshape(a, [m, n])` 和 `tf.transpose(tf.reshape(a, [n, m]))` 之间的区别？
其实，我正在coursera上做deeplearning.ai的作业“Art Generation with Neural Style Transfer”。在函数 compute_layer_styl
python - 为什么 tf.cond() 将 tf.bool 识别为 python bool 而不是 tf.bool？
训练神经网络学习“异或” 我正在尝试使用“批量归一化”，我创建了一个批量归一化层函数“batch_norm1”。 import tensorflow as tf import nump
python - Tensorflow:在使用 tf.Keras 层或 tf.Estimator API 时，何时需要运行 tf.Session()？
我正在尝试协调来自 TF“图形和 session ”指南以及 TF“Keras”指南和 TF Estimators 指南的信息。现在在前者中它说 tf.Session 使计算图能够访问物理硬件以执行图
python - Tensorflow softmax_cross_entropy_with_logits 与 tf.reduce_mean(-tf.reduce_sum(y*tf.log(yhat), reduction_indices = 1))
我正在关注此处的多层感知器示例:https://github.com/aymericdamien/TensorFlow-Examples我对函数 tf.nn.softmax_cross_entropy
python - TensorFlow 2.0 : how to group graph using tf. 喀拉斯？ tf.name_scope/tf.variable_scope 不再使用了吗？
回到 TensorFlow = 2.0 中消失了。因此，像这样的解决方案...... with tf.variable_scope("foo"): with tf.variable_scope
python - [python -c "import tensorflow as tf;print(tf.reduce_sum(tf.random.normal([1000, 1000])))"] 是做什么的？
我按照官方网站中的步骤安装了tensorflow。但是，在该网站中，作为安装的最后一步，他们给出了一行代码来“验证安装”。但他们没有告诉这段代码会给出什么输出。该行是: python -c "imp
python - 为什么 tf.matmul(a,b, transpose_b=True) 有效，但 tf.matmul(a, tf.transpose(b)) 无效？
代码: x = tf.constant([1.,2.,3.], shape = (3,2,4)) y = tf.constant([1.,2.,3.], shape = (3,21,4)) tf.ma
python - TypeError : Only integers, slices (`:` ), ellipsis (`…` ), tf.newaxis (`None` ) 和标量 tf.int32/tf.int64 张量是有效的索引，得到 [1, 3]
我正在尝试从 Github 训练一个 3D 分割网络.我的模型是用 Keras (Python) 实现的，这是一个典型的 U-Net 模型。模型，总结如下， Model: "functional_3"
tensorflow - 在 TF 操作中评估 TF 模型会引发错误
我正在使用 TensorFlow 2。我正在尝试优化一个函数，该函数使用经过训练的 tensorflow 模型(毒药)的损失。 @tf.function def totalloss(x): x
python - tf.zeros() 是否返回 tf.get_variable()？
试图了解 keras 优化器中的 SGD 优化代码 (source code)。在 get_updates 模块中，我们有: # momentum shapes = [K.int_shape(p) f

首页

博学

6Ren·AI

商城

python - tf-idf 向量化器在 char_wb 的特征词中有空格？