gpt4 book ai didi

python - scikit 学习 CountVectorizer UnicodeDecodeError

转载 作者:行者123 更新时间:2023-11-28 21:08:21 25 4
gpt4 key购买 nike

我有以下代码片段,其中我试图列出术语频率,其中 first_textsecond_text.tex 文档:

from sklearn.feature_extraction.text import CountVectorizer
training_documents = (first_text, second_text)
vectorizer = CountVectorizer()
vectorizer.fit_transform(training_documents)
print "Vocabulary:", vectorizer.vocabulary

当我运行脚本时,我得到以下信息:

File "test.py", line 19, in <module>
vectorizer.fit_transform(training_documents)
File "/usr/local/lib/python2.7/site-packages/sklearn/feature_extraction/text.py", line 817, in fit_transform
self.fixed_vocabulary_)
File "/usr/local/lib/python2.7/site-packages/sklearn/feature_extraction/text.py", line 752, in _count_vocab
for feature in analyze(doc):
File "/usr/local/lib/python2.7/site-packages/sklearn/feature_extraction/text.py", line 238, in <lambda>
tokenize(preprocess(self.decode(doc))), stop_words)
File "/usr/local/lib/python2.7/site-packages/sklearn/feature_extraction/text.py", line 115, in decode
doc = doc.decode(self.encoding, self.decode_error)
File "/usr/local/Cellar/python/2.7.11/Frameworks/Python.framework/Versions/2.7/lib/python2.7/encodings/utf_8.py", line 16, in decode
return codecs.utf_8_decode(input, errors, True)
UnicodeDecodeError: 'utf8' codec can't decode byte 0xa2 in position 200086: invalid start byte

我该如何解决这个问题?

谢谢。

最佳答案

如果你能弄清楚文档的编码是什么(也许它们是 latin-1),你可以将其传递给 CountVectorizer

vectorizer = CountVectorizer(encoding='latin-1')

否则你可以跳过包含有问题的字节的标记

vectorizer = CountVectorizer(decode_error='ignore')

关于python - scikit 学习 CountVectorizer UnicodeDecodeError,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/40077084/

25 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com