gpt4 book ai didi

python - sklearn model.predict 使用 kf.split 拆分后的错误形状

转载 作者:行者123 更新时间:2023-12-04 03:29:01 24 4
gpt4 key购买 nike

我尝试使用 sklearn 预测我的文本字符串模型,代码如下

from sklearn import datasets

news = datasets.load_files("dataset-news", encoding='latin1', categories=categories)

def vectorize_data(data):
count_vect = CountVectorizer()
return count_vect.fit_transform(data)

# Gaussian naive Bayes
def gaussian_train(train, target):
gnb = GaussianNB()
gnb.fit(train, target)
return gnb

kf = KFold(n_splits=5)
counter = 1

for train_idx, test_idx in kf.split(news.data):
print ("%d Fold" % counter)
train_data = vectorize_data(np.array(news.data)[train_idx])
test_data = vectorize_data(np.array(news.data)[test_idx])

print("Gaussian naive Bayes")
print(train_data.shape)
print(test_data.shape)
g_model_train = gaussian_train(train_data.toarray(), news.target[train_idx])
# predict_data(g_model_fold, test_data.toarray(), target_data)
# Predict unseen test data based on fitted classifer
predicted = g_model_fold.predict(test_data.toarray())

从我的控制台

1 Fold
Gaussian naive Bayes
(640, 13477)
(161, 5193)

但是后来我得到了

ValueError: operands could not be broadcast together with shapes (161,5193) (14214,) 

如何解决这个问题?

最佳答案

当您将文本转换为标记计数时,使用的特征应该相同,以便矩阵具有相同的列数。一种选择是从训练数据中返回 countvectorizer 并将其用于测试数据。所以我们设置了 vectorize_data() 函数:

from sklearn import datasets
from sklearn.model_selection import KFold
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.naive_bayes import GaussianNB

def vectorize_data(data):
count_vect = CountVectorizer()
return count_vect.fit(data)

使用示例数据集:

categories = ['alt.atheism', 'sci.space']

news = datasets.fetch_20newsgroups(categories=categories)

运行 kfold :

kf = KFold(n_splits=5)

for train_idx, test_idx in kf.split(news.data):

cvect = vectorize_data(np.array(news.data)[train_idx])
train_data = cvect.transform(np.array(news.data)[train_idx])
test_data = cvect.transform(np.array(news.data)[test_idx])

print("Gaussian naive Bayes")
print(train_data.shape)
print(test_data.shape)
g_model_train = gaussian_train(train_data.toarray(), news.target[train_idx])
# predict_data(g_model_fold, test_data.toarray(), target_data)
# Predict unseen test data based on fitted classifer
predicted = g_model_train.predict(test_data.toarray())

输出:

Gaussian naive Bayes
(858, 20415)
(215, 20415)
Gaussian naive Bayes
(858, 20019)
(215, 20019)
Gaussian naive Bayes
(858, 20094)
(215, 20094)
Gaussian naive Bayes
(859, 20119)
(214, 20119)
Gaussian naive Bayes
(859, 20207)
(214, 20207)

关于python - sklearn model.predict 使用 kf.split 拆分后的错误形状,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/67207580/

24 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com