vectorization - 用户警告 : Your stop_words may be inconsistent with your preprocessing-6ren

vectorization - 用户警告 : Your stop_words may be inconsistent with your preprocessing

转载作者：行者123 更新时间：2023-12-03 14:50:33

26

4

我正在关注 this文档聚类教程。作为输入，我提供了一个可以下载的 txt 文件 here .它是 3 个其他 txt 文件的组合文件，使用\n 分隔。创建 tf-idf 矩阵后，我收到此警告:

,,UserWarning: 你的 stop_words 可能与你的预处理不一致。标记停用词生成的标记 ['abov', 'afterward', 'alon', 'alreadi', 'alway', 'ani', 'anoth', 'anyon', 'anyth', 'anywher', 'becam' , 'becaus', 'becom', 'befor', 'besid', 'cri', 'describ', 'dure', 'els', 'elsewher', 'empti', 'everi', 'everyon', '一切', 'everywher', 'fifti', 'forti', 'henc', 'hereaft', 'herebi', 'howev', 'hundr', 'inde', 'mani', 'meanwhil', 'moreov' , 'nobodi', 'noon', 'noth', 'noher', 'onc', 'onli', 'otherwis', 'ourselv', 'perhap', 'pleas', 'sever', 'sinc', '真诚', 'sixti', 'someon', 'someth', 'sometim', 'somewher', 'themselv', 'thenc', 'thereaft', 'therebi', 'therefor', 'togeth', 'twelv' , 'twenti', 'veri', 'whatev', 'whenc', 'whenev', 'wherea', 'whereaft', 'wherebi', 'wherev', 'whi', 'yourselv'] 不在 stop_words 中。
'stop_words。' % 排序(不一致))”。

我想这与词形还原和停用词删除的顺序有关，但由于这是我在txt处理中的第一个项目，我有点迷茫，我不知道如何解决这个问题...

import pandas as pd
import nltk
from nltk.corpus import stopwords
import re
import os
import codecs
from sklearn import feature_extraction
import mpld3
from nltk.stem.snowball import SnowballStemmer
from sklearn.feature_extraction.text import TfidfVectorizer


stopwords = stopwords.words('english')
stemmer = SnowballStemmer("english")

def tokenize_and_stem(text):
    # first tokenize by sentence, then by word to ensure that punctuation is caught as it's own token
    tokens = [word for sent in nltk.sent_tokenize(text) for word in nltk.word_tokenize(sent)]
    filtered_tokens = []
    # filter out any tokens not containing letters (e.g., numeric tokens, raw punctuation)
    for token in tokens:
        if re.search('[a-zA-Z]', token):
            filtered_tokens.append(token)
    stems = [stemmer.stem(t) for t in filtered_tokens]
    return stems


def tokenize_only(text):
    # first tokenize by sentence, then by word to ensure that punctuation is caught as it's own token
    tokens = [word.lower() for sent in nltk.sent_tokenize(text) for word in nltk.word_tokenize(sent)]
    filtered_tokens = []
    # filter out any tokens not containing letters (e.g., numeric tokens, raw punctuation)
    for token in tokens:
        if re.search('[a-zA-Z]', token):
            filtered_tokens.append(token)
    return filtered_tokens


totalvocab_stemmed = []
totalvocab_tokenized = []
with open('shortResultList.txt', encoding="utf8") as synopses:
    for i in synopses:
        allwords_stemmed = tokenize_and_stem(i)  # for each item in 'synopses', tokenize/stem
        totalvocab_stemmed.extend(allwords_stemmed)  # extend the 'totalvocab_stemmed' list
        allwords_tokenized = tokenize_only(i)
        totalvocab_tokenized.extend(allwords_tokenized)

vocab_frame = pd.DataFrame({'words': totalvocab_tokenized}, index = totalvocab_stemmed)
print ('there are ' + str(vocab_frame.shape[0]) + ' items in vocab_frame')
print (vocab_frame.head())

#define vectorizer parameters
tfidf_vectorizer = TfidfVectorizer(max_df=0.8, max_features=200000,
                                 min_df=0.2, stop_words='english',
                                 use_idf=True, tokenizer=tokenize_and_stem, ngram_range=(1,3))

with open('shortResultList.txt', encoding="utf8") as synopses:
    tfidf_matrix = tfidf_vectorizer.fit_transform(synopses) #fit the vectorizer to synopses

print(tfidf_matrix.shape)

最佳答案

该警告试图告诉您，如果您的文本包含“始终”，则在与包含“始终”但不包含“始终”的停止列表匹配之前，它将被规范化为“始终”。所以它不会从你的词袋中删除。

解决方案是确保你预处理你的停止列表以确保它像你的标记一样被规范化，并将规范化的单词列表作为 stop_words 传递。到矢量化器。

关于vectorization - 用户警告 : Your stop_words may be inconsistent with your preprocessing，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/57340142/

26

4

0

文章推荐： angular - 在本地构建和使用 npm 包

文章推荐： authorization - MapWhen() 中的 MVC 授权应用于所有 Controller

sql - Oracle SQL CASE WHEN ORA-00932 : inconsistent datatypes: expected CHAR got NUMBER 00932. 00000 - "inconsistent datatypes: expected %s got %s"
获取错误 ORA-00932: inconsistent datatypes: expected CHAR got NUMBER 00932. 00000 - "inconsistent dataty
Sass编译器说: Inconsistent indentation
我的main.sass中有这个: #thing { -moz-box-sizing: border-box; -webkit-box-sizing: border
Python切片[:] inconsistent behavior
在 Python 中，为什么 [:] 切片操作的行为不一致？它对于列表和字符串的行为有所不同。对于列表，它给出一个副本列表对象，对于字符串，它给出相同的字符串对象。我觉得这令人困惑，违反直觉。有
regex - "inconsistent"在正则表达式中使用代码块时的匹配结果 [Raku]
在检查和测试正则表达式的各个方面时，我偶然发现了一种奇怪且“不一致”的行为。我试图在正则表达式中使用一些代码，但同样的行为也适用于使用 void 代码块。尤其是最让我感动的是，当我互换 :g 和 :
java - Android蓝牙周期性调用inputStream和outputStream : inconsistent timestamps
已连接两个支持蓝牙的设备。一个通过outputStream向另一个发送周期性时间戳(writeTime)，另一个通过inputStream检索writeTimes并附加自己的时间戳(readTime)
python - 使用networkx从距离矩阵生成图: inconsistency - Python
我有以下距离矩阵: delta = [[ 0. 0.71370845 0.80903791 0.82955157 0.56964983 0. 0.
android - 索引越界异常 : Inconsistency detected
我正在使用 recyclerView 并将数据加载为 arrayList。如果 arrayList 少于 7 个项目，则不会发生崩溃。否则，我会遇到这个 fatal error : java.lan
java - 双除以零 : Why is the result inconsistent?
为什么结果是: double a = 0.0/0.0; double b = 0/0.0; = NaN 但是结果例如: double e = 0.1/0.0; double e = 12.0/0.0;
java - "inconsistent synchronization"是什么意思？
这是我的 Java 1.6 类: public class Foo { private ArrayList names; public void scan() { if (names
java - 足球预测程序encog : Inconsistent predictions
我正在制作一个使用 encog 预测足球比赛结果的程序。我创建了一个神经网络，使用弹性传播训练方法使用 90 场比赛的数据对其进行训练。我将比赛结果标记为 1 表示主场获胜，0 表示平局，-1 表示客
c# - 类定义上的 "Inconsistent accessibility"
我正在向我的 App 类中正在进行的 WPF 应用程序添加一些可绑定(bind)的 CLR 属性，但由于这个不一致的可访问性错误，我无法编译。 Inconsistent Accessibility:
R函数solve.QP错误 "constraints are inconsistent, no solution!"
我正在尝试使用带有以下参数的 solve.QP 函数(来自 quadprog 包)运行优化 R = matrix( c( 2.231113e-05,-4.816095e-05,-5.115287e-0
Solr 刻面 : Inconsistent JSON formatting
我的 solr 架构中有以下两个字段: 当我在启用 facet 的情况下发出请求(faceting on brand_id) http://example.com/solr/select?wt=j
iphone - UIButton和setImage : not working consistently - inconsistent
我在 UIView 上有一个 UIButton。我想以编程方式确定显示 View 时在按钮内显示哪个图像。我已经重写了 UIView 中的 drawRect 方法并使用 setImage 来显示所需的
Groovy 真相 : string to boolean inconsistency?
在常规中: println 'test' as Boolean //true println 'test'.toBoolean() //false println new Boolean('test'
grails i18n 错误自定义 : various inconsistencies
例如，在 message.properties 中空白字段的默认 i18n 消息是: default.blank.message=Property [{0}] of class [{1}] canno
php - array_multisort() : Array sizes are inconsistent
我正在尝试使用 array_multisort() 在其子数组的基础上对数组进行排序功能...... 在尝试的同时； print_r($mar); echo ''; $arr2 = array_mul
java - MALLET 主题建模 : Inconsistent Estimations
我正在使用 MALLET 来训练 ParallelTopicModel。训练后，我得到一个 TopicInferencer，取一个句子，通过推理器运行 15 次，然后检查结果。我发现对于某些主题，每次
Javascript 闭包 : Inconsistent reuse of variable
1) 为什么在 JavaScript 中存在这种不一致 - 我期望第四行也返回 11: (function(n, m) { n = n + m; return n })(3, 8)
android - 相对布局 : layout_marginLeft inconsistent behaviour
上下文: 我有一个小部件，它基本上由一个包装了一堆 TextView 的 RelativeLayout 组成。这是我希望小部件的外观，然后是 XML 布局代码: 问题:我

首页

博学

6Ren·AI

商城

vectorization - 用户警告 : Your stop_words may be inconsistent with your preprocessing