python - 理解python scikit-learn中的文本特征提取TfidfVectorizer-6ren

python - 理解python scikit-learn中的文本特征提取TfidfVectorizer

转载作者：太空狗更新时间：2023-10-29 19:36:22

阅读 scikit-learn 中文本特征提取的文档，我不确定可用于 TfidfVectorizer(也可能是其他矢量化器)的不同参数如何影响结果。

以下是我不确定它们如何工作的参数:

TfidfVectorizer(stop_words='english',  ngram_range=(1, 2), max_df=0.5, min_df=20, use_idf=True)

文档清楚地说明了 stop_words/max_df 的使用(两者都有类似的效果，可能是一个可以代替另一个)。但是，我不确定这些选项是否应该与 ngrams 一起使用。哪个先发生/处理，ngrams 还是 stop_words？为什么？根据我的实验，先去除停用词，但ngrams的目的是提取短语等。我不确定这个序列的效果(停止词去除然后ngramed)。

其次，将 max_df/min_df 参数与 use_idf 参数一起使用是否有意义？这些的目的不是很相似吗？

最佳答案

我在这篇文章中看到了几个问题。

How do the different arguments in TfidfVectorizer interact with one another?

你真的必须大量使用它来培养直觉(无论如何都是我的经验)。

TfidfVectorizer 是一个词袋方法。在 NLP 中，单词序列及其窗口很重要；这种破坏了一些上下文。

如何控制输出哪些 token ？

套装 ngram_range to (1,1) 仅输出单字标记，(1,2) 输出单字标记和两字标记，(2, 3) 输出两字标记和三字标记等。
ngram_range携手合作 analyzer .套装 analyzer为“word”输出单词和短语，或设置为“char”输出字符ngrams。

如果您希望您的输出同时具有“word”和“char”功能，请使用 sklearn 的 FeatureUnion。示例 here .

如何删除不需要的东西？

使用 stop_words删除意义不大的英语单词。

sklearn 使用的停用词列表可以在以下位置找到:

from sklearn.feature_extraction.stop_words import ENGLISH_STOP_WORDS

删除停用词的逻辑与以下事实有关，即这些词没有太多含义，并且它们在大多数文本中出现很多:

[('the', 79808),
 ('of', 40024),
 ('and', 38311),
 ('to', 28765),
 ('in', 22020),
 ('a', 21124),
 ('that', 12512),
 ('he', 12401),
 ('was', 11410),
 ('it', 10681),
 ('his', 10034),
 ('is', 9773),
 ('with', 9739),
 ('as', 8064),
 ('i', 7679),
 ('had', 7383),
 ('for', 6938),
 ('at', 6789),
 ('by', 6735),
 ('on', 6639)]

由于停用词通常具有很高的频率，因此使用 max_df 可能是有意义的。作为说 0.95 的浮点数以删除前 5%，但是您假设前 5% 都是停用词，但情况可能并非如此。这实际上取决于您的文本数据。在我的工作中，最常见的词或短语不是停用词是很常见的，因为我在非常特定的主题中使用密集文本(搜索查询数据)。

使用 min_df作为一个整数来删除罕见的单词。如果它们只出现一次或两次，它们不会增加太多值(value)，而且通常非常晦涩。此外，通常有很多，所以忽略它们说 min_df=5可以大大减少您的内存消耗和数据大小。

我如何包括被剥离的东西？
token_pattern使用正则表达式 \b\w\w+\b这意味着标记必须至少有 2 个字符长，因此像“I”、“a”这样的词被删除，并且像 0 - 9 这样的数字也被删除。您还会注意到它删除了撇号

What happens first, ngram generation or stop word removal?

让我们做一个小测试。

import numpy as np
import pandas as pd

from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer
from sklearn.feature_extraction.stop_words import ENGLISH_STOP_WORDS

docs = np.array(['what is tfidf',
        'what does tfidf stand for',
        'what is tfidf and what does it stand for',
        'tfidf is what',
        "why don't I use tfidf",
        '1 in 10 people use tfidf'])

tfidf = TfidfVectorizer(use_idf=False, norm=None, ngram_range=(1, 1))
matrix = tfidf.fit_transform(docs).toarray()

df = pd.DataFrame(matrix, index=docs, columns=tfidf.get_feature_names())

for doc in docs:
    print(' '.join(word for word in doc.split() if word not in ENGLISH_STOP_WORDS))

这打印出来:

tfidf
does tfidf stand
tfidf does stand
tfidf
don't I use tfidf
1 10 people use tfidf

现在让我们打印 df:

                                           10  and  does  don  for   in   is  \
what is tfidf                             0.0  0.0   0.0  0.0  0.0  0.0  1.0   
what does tfidf stand for                 0.0  0.0   1.0  0.0  1.0  0.0  0.0   
what is tfidf and what does it stand for  0.0  1.0   1.0  0.0  1.0  0.0  1.0   
tfidf is what                             0.0  0.0   0.0  0.0  0.0  0.0  1.0   
why don't I use tfidf                     0.0  0.0   0.0  1.0  0.0  0.0  0.0   
1 in 10 people use tfidf                  1.0  0.0   0.0  0.0  0.0  1.0  0.0   

                                           it  people  stand  tfidf  use  \
what is tfidf                             0.0     0.0    0.0    1.0  0.0   
what does tfidf stand for                 0.0     0.0    1.0    1.0  0.0   
what is tfidf and what does it stand for  1.0     0.0    1.0    1.0  0.0   
tfidf is what                             0.0     0.0    0.0    1.0  0.0   
why don't I use tfidf                     0.0     0.0    0.0    1.0  1.0   
1 in 10 people use tfidf                  0.0     1.0    0.0    1.0  1.0   

                                          what  why  
what is tfidf                              1.0  0.0  
what does tfidf stand for                  1.0  0.0  
what is tfidf and what does it stand for   2.0  0.0  
tfidf is what                              1.0  0.0  
why don't I use tfidf                      0.0  1.0  
1 in 10 people use tfidf                   0.0  0.0

注意事项:

use_idf=False, norm=None设置这些后，就相当于使用了 sklearn 的 CountVectorizer。它只会返回计数。

注意单词“don't”被转换为“don”。这是您要更改的地方 token_pattern类似于 token_pattern=r"\b\w[\w']+\b"包括撇号。

我们看到很多停用词

让我们移除停用词并再次查看 df:

tfidf = TfidfVectorizer(use_idf=False, norm=None, stop_words='english', ngram_range=(1, 2))

输出:

                                           10  10 people  does  does stand  \
what is tfidf                             0.0        0.0   0.0         0.0   
what does tfidf stand for                 0.0        0.0   1.0         0.0   
what is tfidf and what does it stand for  0.0        0.0   1.0         1.0   
tfidf is what                             0.0        0.0   0.0         0.0   
why don't I use tfidf                     0.0        0.0   0.0         0.0   
1 in 10 people use tfidf                  1.0        1.0   0.0         0.0   

                                          does tfidf  don  don use  people  \
what is tfidf                                    0.0  0.0      0.0     0.0   
what does tfidf stand for                        1.0  0.0      0.0     0.0   
what is tfidf and what does it stand for         0.0  0.0      0.0     0.0   
tfidf is what                                    0.0  0.0      0.0     0.0   
why don't I use tfidf                            0.0  1.0      1.0     0.0   
1 in 10 people use tfidf                         0.0  0.0      0.0     1.0   

                                          people use  stand  tfidf  \
what is tfidf                                    0.0    0.0    1.0   
what does tfidf stand for                        0.0    1.0    1.0   
what is tfidf and what does it stand for         0.0    1.0    1.0   
tfidf is what                                    0.0    0.0    1.0   
why don't I use tfidf                            0.0    0.0    1.0   
1 in 10 people use tfidf                         1.0    0.0    1.0   

                                          tfidf does  tfidf stand  use  \
what is tfidf                                    0.0          0.0  0.0   
what does tfidf stand for                        0.0          1.0  0.0   
what is tfidf and what does it stand for         1.0          0.0  0.0   
tfidf is what                                    0.0          0.0  0.0   
why don't I use tfidf                            0.0          0.0  1.0   
1 in 10 people use tfidf                         0.0          0.0  1.0   

                                          use tfidf  
what is tfidf                                   0.0  
what does tfidf stand for                       0.0  
what is tfidf and what does it stand for        0.0  
tfidf is what                                   0.0  
why don't I use tfidf                           1.0  
1 in 10 people use tfidf                        1.0

外卖:

token “don use”发生是因为 don't I use有 't脱光了，因为I少于两个字符，它被删除，所以单词加入 don use ...实际上不是结构，可能会稍微改变结构!

答案:去除停用词，去除短字符，然后生成 ngram，这会返回意想不到的结果。

does it make sense to use max_df/min_df arguments together with use_idf argument?

我认为，词频逆文档频率的重点是允许重新加权高频词(出现在排序频率列表顶部的词)。这种重新加权将采用频率最高的 ngram 并将它们在列表中向下移动到较低的位置。因此，它应该处理 max_df场景。

也许您是想将它们从列表中向下移动(“重新加权”/降低它们的优先级)或完全删除它们，这更多是个人选择。

我用 min_df很多，使用 min_df 是有意义的如果您正在处理一个庞大的数据集，因为稀有词不会增加值(value)，只会导致很多处理问题。我不使用 max_df很多，但我确信在处理像所有维基百科这样的数据时，有些情况下删除前 x% 可能是有意义的。

关于python - 理解python scikit-learn中的文本特征提取TfidfVectorizer，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/47557417/

文章推荐： regex - 仅用于数字的 Angular 5 Validators.pattern 正则表达式

文章推荐： python - 如何抑制结果中显示的 Tensorflow 警告

scikit-learn - scikit learn中MinMaxScaler中属性min_的含义
来自文档: sklearn.preprocessing.MinMaxScaler.min_ : ndarray, shape (n_features,) Per feature adjustment
scikit-learn - scikit svm回归预测恒定结果
这是我的数据:(我重置了索引。日期应该是索引) Date A B C D 0 2013-10-07 -0.002
scikit-learn - Scikit - 更改阈值以创建多个混淆矩阵
我正在构建一个分类器，通过贷款俱乐部数据，选择最好的 X 笔贷款。我训练了一个随机森林，并创建了通常的 ROC 曲线、混淆矩阵等。混淆矩阵将分类器的预测(森林中树木的多数预测)作为参数。但是，我希望
scikit-learn - scikit-learn 中的成本敏感分析
是否有类似于的 scikit-learn 方法/类元成本在 Weka 或其他实用程序中实现的算法以执行常量敏感分析？最佳答案不，没有。部分分类器提供 class_weight和 sample_
scikit-learn - Scikit Learn 分层交叉验证中的差异
我发现使用相同数据的两种交叉验证技术之间的分类性能存在差异。我想知道是否有人可以阐明这一点。方法一:cross_validation.train_test_split 方法 2:分层折叠。具有相同
scikit-learn - scikit-learn 中嵌套交叉验证的令人困惑的例子
我正在查看 scikit-learn 文档中的这个示例:http://scikit-learn.org/0.18/auto_examples/model_selection/plot_nested_c
scikit-learn - scikit-learn 中的哪些估计器不支持稀疏矩阵？
我想训练一个具有很多标称属性的数据集。我从一些帖子中注意到，要转换标称属性必须将它们转换为重复的二进制特征。另外据我所知，这样做在概念上会使数据集稀疏。我也知道 scikit-learn 使用稀疏矩阵
scikit-learn - 多标签分类的特征选择(scikit-learn)
我正在尝试在 scikit-learn (sklearn.feature_selection.SelectKBest) 中通过卡方方法进行特征选择。当我尝试将其应用于多标签问题时，我收到此警告: 用户
scikit-learn - scikit-learn 默认使用哪种决策树算法？
有几种算法可以构建决策树，例如 CART(分类和回归树)、ID3(迭代二分法 3)等 scikit-learn 默认使用哪种决策树算法？当我查看一些决策树 python 脚本时，它神奇地生成了带有
scikit-learn - 多标签分类的特征选择(scikit-learn)
我正在尝试在 scikit-learn (sklearn.feature_selection.SelectKBest) 中通过卡方方法进行特征选择。当我尝试将其应用于多标签问题时，我收到此警告: 用户
scikit-learn - scikit-learn 默认使用哪种决策树算法？
有几种算法可以构建决策树，例如 CART(分类和回归树)、ID3(迭代二分法 3)等 scikit-learn 默认使用哪种决策树算法？当我查看一些决策树 python 脚本时，它神奇地生成了带有
scikit-learn - scikit-learn 的进度条？
有没有办法让 scikit-learn 中的 fit 方法有一个进度条？是否可以包含自定义的类似 Pyprind 的内容？ ? 最佳答案如果您使用 verbose=1 初始化模型调用前 fit你应
scikit-learn - scikit-learn 中交叉验证的一种标准错误规则
我正在尝试使用 grisSearchCV 在 scikit-learn 中拟合一些模型，并且我想使用“一个标准错误”规则来选择最佳模型，即从分数在 1 以内的模型子集中选择最简约的模型最好成绩的标准误
scikit-learn - 是否可以在 Scikit-learn 中使用自定义的决策树分类器？
我有一个预定义的决策树，它是根据基于知识的拆分构建的，我想用它来进行预测。我可以尝试从头开始实现决策树分类器，但那样我就无法在 Scikit 函数中使用 predict 等内置函数。有没有办法将我的树
scikit-learn - 使用随机森林时在 scikit-learn 中表示因子变量的方法是什么？
我正在使用随机森林解决分类问题。为此，我决定使用 Python 库 scikit-learn。但我对随机森林算法和这个工具都很陌生。我的数据包含许多因子变量。我用谷歌搜索，发现像我们在线性回归中所做的
scikit-learn - 如何在 Scikit-learn 管道中访问回归器的权重
我使用 Keras 回归器对数据进行回归拟合。我使用 Scikit-learn wrapper 和 Pipeline 来首先标准化数据，然后将其拟合到 Keras 回归器上。有点像这样: from s
scikit-learn - 在 scikit-learn 中按名称获取评分函数
在 scikit-learn ，有一个的概念评分函数 .如果我们有一些预测标签和真实标签，我们可以通过调用 scoring(y_true, y_predict) 来获得分数。 .这种评分函数的一个例
scikit-learn - scikit learn : train_test_split, 我可以确保在不同的数据集上进行相同的拆分吗
我知道 train_test_split 方法将数据集拆分为随机训练和测试子集。并且使用 random_state=int 可以确保每次调用该方法时我们对该数据集都有相同的拆分。我的问题略有不同。
scikit-learn - 如何在 scikit-learn 中为最近邻居使用用户定义的度量？
我正在使用 scikit-learn 0.18.dev0。我知道之前有人问过完全相同的问题 here .我尝试了那里提供的答案，但出现以下错误 >>> def mydist(x, y): ...
scikit-learn - 在 scikit-learn 中结合递归特征消除和网格搜索
我试图在 scikit-learn 中结合递归特征消除和网格搜索。正如您从下面的代码(有效)中看到的那样，我能够从网格搜索中获得最佳估计量，然后将该估计量传递给 RFECV。但是，我宁愿先进行 RFE

太空狗

个人简介

我是一名优秀的程序员,十分优秀！

作者热门文章

滴滴打车优惠券免费领取

全站热门文章

首页

博学

6Ren·AI

商城

python - 理解python scikit-learn中的文本特征提取TfidfVectorizer