gpt4 book ai didi

python - CountVectorizer 中的样本数量不一致

转载 作者:行者123 更新时间:2023-11-30 09:12:35 25 4
gpt4 key购买 nike

我正在尝试对我拥有的一组推文使用多项式朴素贝叶斯分类。

这是我的代码:

import codecs
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.naive_bayes import MultinomialNB
trainfile = 'train.txt'
testfile = 'test.txt'
word_vectorizer = CountVectorizer(analyzer='word')
trainset = word_vectorizer.fit_transform(codecs.open(trainfile,'r','utf8')) ## Error here
tags = ['Pro_vax','Anti_vax','Neither']
mnb = MultinomialNB()
mnb.fit(trainset, tags)
codecs.open(testfile,'r','utf8')
testset = word_vectorizer.transform(codecs.open(testfile,'r','utf8'))
results = mnb.predict(testset)
print results

文件 train.txt 中包含以下文本:

Vaccines are a very good idea.  They prevent all sorts of deadly diseases.
Vaccines cause autism. Do not vaccinate your children
Going to read about vaccines. Then, I am going to see my brother with autism.

我已使用 tags 变量对它们进行了标记。

文件test.txt具有以下文本:

Do not get your kids vaccinated.  Vaccination and autism are correlated.

当我运行脚本时,出现以下错误:

ValueError: Found arrays with inconsistent numbers of samples: [3 9]

我不熟悉该错误。这是什么意思?如何防止它再次弹出?

最佳答案

如果提供完整的回溯,会更容易看出,但看起来标签包含 9 个条目,而 train 仅包含三个训练数据点。 标签是什么样的?

关于python - CountVectorizer 中的样本数量不一致,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/29809731/

25 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com