gpt4 book ai didi

python - 使用参数 stop_words 时 scikit 学习 TfidfVectorizer 时出错

转载 作者:太空宇宙 更新时间:2023-11-04 10:41:31 26 4
gpt4 key购买 nike

我是 scikit 学习的新手。我正在尝试进行 tfidf 向量化以适应 1*M numpy.array 即 tot_data(在下面的代码中),由英文句子组成。这里的 'words' 是一个 numpy.array (1*173),包含停用词列表。我需要明确定义参数 stop_words。如果我不明确使用参数 stop_words,代码运行良好,但下面的行显示错误。

word = numpy.array(['a','about',...])

>>> vectorizer = TfidfVectorizer(max_df=.95,stop_words=word).fit(tot_data)
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/usr/local/python2.7/lib/python2.7/site-packages/sklearn/feature_extraction/text.py", line 1203, in fit
X = super(TfidfVectorizer, self).fit_transform(raw_documents)
File "/usr/local/python2.7/lib/python2.7/site-packages/sklearn/feature_extraction/text.py", line 780, in fit_transform
vocabulary, X = self._count_vocab(raw_documents, self.fixed_vocabulary)
File "/usr/local/python2.7/lib/python2.7/site-packages/sklearn/feature_extraction/text.py", line 710, in _count_vocab
analyze = self.build_analyzer()
File "/usr/local/python2.7/lib/python2.7/site-packages/sklearn/feature_extraction/text.py", line 225, in build_analyzer
stop_words = self.get_stop_words()
File "/usr/local/python2.7/lib/python2.7/site-packages/sklearn/feature_extraction/text.py", line 208, in get_stop_words
return _check_stop_list(self.stop_words)
File "/usr/local/python2.7/lib/python2.7/site-packages/sklearn/feature_extraction/text.py", line 85, in _check_stop_list
if stop == "english":
ValueError: The truth value of an array with more than one element is ambiguous. Use a.any() or a.all()

最佳答案

原因:错误的原因是numpy数组将比较传播到元素:

>>> word == 'english'
array([False, False, False], dtype=bool)

if 语句无法将结果数组转换为 bool 值:

>>> if word == 'english': pass
...
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
ValueError: The truth value of an array with more than one element is ambiguous. Use a.any() or a.all()

解决方案:将单词转换为普通列表:words = list(words)

演示:

>>> import numpy as np
>>> from sklearn.feature_extraction.text import TfidfVectorizer
>>> word = np.array(['one','two','three'])
>>> tot_data = np.array(['one two three', 'who do I see', 'I see two girls'])
>>> v = TfidfVectorizer(max_df=.95,stop_words=list(word))
>>> v.fit(tot_data)
TfidfVectorizer(analyzer=u'word', binary=False, charset=None,
...
tokenizer=None, use_idf=True, vocabulary=None)

关于python - 使用参数 stop_words 时 scikit 学习 TfidfVectorizer 时出错,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/20400454/

26 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com