python - 从 TFIDFVectorizer/CountVectorizer 减少词向量的维度-6ren

python - 从 TFIDFVectorizer/CountVectorizer 减少词向量的维度

转载作者：行者123 更新时间：2023-12-04 11:36:20

25

4

我想用 TFIDFVectorizer (或 CountVectorizer 后跟 TFIDFTransformer )以获得我的术语的向量表示。这意味着，我想要一个术语的向量，其中文档是特征。这只是由 TFIDFVectorizer 创建的 TF-IDF 矩阵的转置。

>>> vectorizer = TfidfVectorizer()
>>> model = vectorizer.fit_transform(corpus)
>>> model.transpose()

但是，我有 800k 个文档，这意味着我的术语向量非常稀疏且非常大(800k 维)。旗帜 max_features在 CountVectorizer 中将完全符合我的要求。我可以指定一个维度，而 CountVectorizer 会尝试将所有信息放入该维度。不幸的是，这个选项是针对文档向量而不是词汇表中的术语。因此，它减少了我的词汇量，因为术语就是特征。

有什么办法可以做相反的事情吗？比如，在 TFIDFVectorizer 对象开始切割和规范化所有内容之前对其执行转置？如果存在这种方法，我该怎么做？像这样的东西:

>>> countVectorizer = CountVectorizer(input='filename', max_features=300, transpose=True)

我一直在寻找这种方法，但每个指南、代码示例，无论是在谈论文档 TF-IDF 向量而不是术语向量。
非常感谢您!

最佳答案

我不知道有什么直接的方法可以做到这一点，但让我提出一种如何实现的方法。
您试图将语料库中的每个术语表示为一个向量，该向量使用语料库中的文档作为其组件特征。因为文档的数量(在您的案例中是特征)非常大，所以您希望以类似于 max_features 的方式限制它们。
根据 CountVectorizer用户指南(与 TfidfVectorizer 相同):

max_features int, default=None

If not None, build a vocabulary that only consider the topmax_features ordered by term frequency across the corpus.

以类似的方式，您希望按照“跨术语的频率”对顶级文档进行排序，这听起来可能令人困惑。这可以简单地改写为“保留那些包含最独特术语的文档”。
我能想到的一种方法是使用 inverse_transform 执行以下步骤:

    vectorizer = TfidfVectorizer()
    model = vectorizer.fit_transform(corpus)
    
    # We use the inverse_transform which returns the 
    # terms per document with nonzero entries
    inverse_model = vectorizer.inverse_transform(model)
    
    # Each line in the inverse model corresponds to a document 
    # and contains a list of feature names (the terms).
    # As we want to rank the documents we tranform the list 
    # of feature names to a number of features
    # that each document is represented by.
    inverse_model_count = list(map(lambda doc_vec: len(doc_vec), inverse_model))
    
    # As we are going to sort the list, we need to keep track of the 
    # document id (its index in the corpus), so we create tuples with 
    # the list index of each item before we sort the list.
    inverse_model_count_tuples = list(zip(range(len(inverse_model_count)),
                                          inverse_model_count))
    
    # Then we sort the list by the count of terms 
    # in each document (the second component)
    max_features = 100
    top_documents_tuples = sorted(inverse_model_count_tuples, 
                                  key=lambda item: item[1], 
                                  reverse=True)[:max_features]
    
    # We are interested only in the document ids (the first tuple component)
    top_documents, _ = zip(*top_documents_tuples)
    
    # Having the top_documents ids we can slice the initial model 
    # to keep only the documents indicated by the top_documents list
    reduced_model = model[top_documents]

请注意，此方法仅考虑每个文档的术语数，无论它们的数量 (CountVectorizer) 或权重 (TfidfVectorizer) 是多少。
如果这种方法的方向对您来说是可以接受的，那么使用更多代码，也可以考虑术语的计数或权重。
我希望这有帮助!

关于python - 从 TFIDFVectorizer/CountVectorizer 减少词向量的维度，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/61274499/

25

4

0

文章推荐： Flutter-web:浏览器刷新时提供程序丢失状态

文章推荐： typescript - tsconfig noEmit 必须为真

python - CountVectorizer 但对于文本组
使用以下代码，CountVectorizer 将“风干肉”分解为 3 个不同的向量。但我想要的是将“风干肉”保留为 1 个向量。我该怎么做？我运行的代码: from sklearn.feature_
python - CountVectorizer 中的样本数量不一致
我正在尝试对我拥有的一组推文使用多项式朴素贝叶斯分类。这是我的代码: import codecs from sklearn.feature_extraction.text import CountV
python - CountVectorizer 删除只出现一次的特征
我正在使用 sklearn python 包，我在使用预先创建的字典创建 CountVectorizer 时遇到问题，其中 CountVectorizer 不会删除以下功能只出现一次或根本不出现。这
python - CountVectorizer 给出错误的单词计数？
假设我的文本文件包含以下文本: The quick brown fox jumped over the lazy dogs. A stitch in time saves nine. The quic
python - CountVectorizer 矩阵随新的分类测试数据变化？
我已经使用 python 创建了一个文本分类模型。我有 CountVectorizer，它会生成 2034 行和 4063 列(唯一单词)的文档术语矩阵。我保存了用于新测试数据的模型。我的新测试数据
python - CountVectorizer 忽略大写
CountVectorizer 忽略大写单词的原因是什么？ cv = CountVectorizer(stop_words=None,analyzer='word',token_pattern='.*
python - CountVectorizer 在短词上引发错误
有人会尝试向我解释为什么当我尝试 fit_transform 任何短词时 CountVectorizer 会引发此错误吗？即使我使用 stopwords=None 我仍然会得到同样的错误。这是代码 f
Python:CountVectorizer 忽略一个字母单词 "I"
我有一个名为 dictionary1 的列表。我使用以下代码获取文本的稀疏计数矩阵: cv1 = sklearn.feature_extraction.text.CountVectorizer(sto
python - CountVectorizer，第二次使用相同的词汇表
这是我的数据集: anger,happy food food anger,dog food food disgust,food happy food disgust,food dog food neu
python - 整数列表上的 CountVectorizer
我有如下整数列表: mylist = [111,113,114,115,112,115,234,643,565,.....] 我有很多这样的列表，其中包含超过 500 个整数，我想在其上运行 Coun
python - CountVectorizer 的单个字母的空词汇表
尝试将字符串转换为数值向量， ### Clean the string def names_to_words(names): print('a') words = re.sub("[^
python - CountVectorizer 将单词转换为小写
在我的分类模型中，我需要保留大写字母，但是当我使用 sklearn countVectorizer 构建词汇表时，大写字母转换为小写字母! 为了排除隐式分词，我构建了一个分词器，它只传递文本而无需任何
python - CountVectorizer 变换后出现意外的稀疏矩阵
我是 NLTK 的新人，在创建评论分类器时遇到问题。当作为输入传递的数据的形状为 (10000,1) 时，我无法理解转换后的数据的形状如何是 1*1 稀疏矩阵我对原始评论数据进行了一些处理。比如删除
python - CountVectorizer 只返回零
我正在尝试从给定的文档中提取一些特征，给定一组预定义的特征。 from sklearn.feature_extraction.text import CountVectorizer features
python - CountVectorizer 不打印词汇表
我已经安装了 python 2.7、numpy 1.9.0、scipy 0.15.1 和 scikit-learn 0.15.2。现在，当我在 python 中执行以下操作时: train_set =
machine-learning - CountVectorizer 如何处理测试数据中的新词？
我了解 CountVectorizer 的一般工作原理。它获取单词标记并创建文档(行)和标记计数(列)的稀疏计数矩阵，我们可以将其用于 ML 建模。但是，它如何处理可能出现在测试数据中但未出现在训练
python - CountVectorizer token_pattern 不捕捉下划线
CountVectorizer 默认标记模式将下划线定义为字母 corpus = ['The rain in spain_stays' ] vectorizer = CountVectorizer(t
scikit-learn - CountVectorizer 上的词形还原不会删除停用词
我正在尝试将 Lematization 添加到来自 Skit-learn 的 CountVectorizer，如下 import nltk from pattern.es import lemma f
regex - 在 CountVectorizer 上使用正则表达式删除数字和符号
目前，我有一个 CountVectorizer 函数 CountVectorizer(stop_words=stopwords.words('spanish'),token_pattern=r'(?u
python - 如何在 countVectorizer 中将带有小数或逗号的数字视为一个单词
我正在清理文本，然后将其传递给 CountVectorizer 函数，以计算每个单词在文本中出现的次数。问题在于它将 10,000x 视为两个单词(10 和 000x)。同样，对于 5.00，它将 5

首页

博学

6Ren·AI

商城

python - 从 TFIDFVectorizer/CountVectorizer 减少词向量的维度