gpt4 book ai didi

python-3.x - 如何使用 Countvectorizer() 和 TfidfTransformer() 在 sklearn 中保存分类器

转载 作者:行者123 更新时间:2023-12-04 00:02:29 26 4
gpt4 key购买 nike

下面是分类器的一些代码。我使用pickle来保存和加载这个page中指示的分类器.但是,当我加载它使用它时,我无法使用CountVectorizer()TfidfTransformer()将原始文本转换为分类器可以使用的向量。
我唯一能够让它工作的是在训练分类器后立即分析文本,如下所示。

import os
import sklearn
from sklearn.datasets import load_files

from sklearn.feature_extraction.text import TfidfTransformer
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import confusion_matrix

from sklearn.feature_extraction.text import CountVectorizer
import nltk

import pandas
import pickle

class Classifier:

def __init__(self):

self.moviedir = os.getcwd() + '/txt_sentoken'

def Training(self):

# loading all files.
self.movie = load_files(self.moviedir, shuffle=True)


# Split data into training and test sets
docs_train, docs_test, y_train, y_test = train_test_split(self.movie.data, self.movie.target,
test_size = 0.20, random_state = 12)

# initialize CountVectorizer
self.movieVzer = CountVectorizer(min_df=2, tokenizer=nltk.word_tokenize, max_features=5000)

# fit and tranform using training text
docs_train_counts = self.movieVzer.fit_transform(docs_train)


# Convert raw frequency counts into TF-IDF values
self.movieTfmer = TfidfTransformer()
docs_train_tfidf = self.movieTfmer.fit_transform(docs_train_counts)

# Using the fitted vectorizer and transformer, tranform the test data
docs_test_counts = self.movieVzer.transform(docs_test)
docs_test_tfidf = self.movieTfmer.transform(docs_test_counts)

# Now ready to build a classifier.
# We will use Multinominal Naive Bayes as our model


# Train a Multimoda Naive Bayes classifier. Again, we call it "fitting"
self.clf = MultinomialNB()
self.clf.fit(docs_train_tfidf, y_train)


# save the model
filename = 'finalized_model.pkl'
pickle.dump(self.clf, open(filename, 'wb'))

# Predict the Test set results, find accuracy
y_pred = self.clf.predict(docs_test_tfidf)

# Accuracy
print(sklearn.metrics.accuracy_score(y_test, y_pred))

self.Categorize()

def Categorize(self):
# very short and fake movie reviews
reviews_new = ['This movie was excellent', 'Absolute joy ride', 'It is pretty good',
'This was certainly a movie', 'I fell asleep halfway through',
"We can't wait for the sequel!!", 'I cannot recommend this highly enough', 'What the hell is this shit?']

reviews_new_counts = self.movieVzer.transform(reviews_new) # turn text into count vector
reviews_new_tfidf = self.movieTfmer.transform(reviews_new_counts) # turn into tfidf vector


# have classifier make a prediction
pred = self.clf.predict(reviews_new_tfidf)

# print out results
for review, category in zip(reviews_new, pred):
print('%r => %s' % (review, self.movie.target_names[category]))

最佳答案

在 MaximeKan 的建议下,我研究了一种保存所有 3 个的方法。

保存模型和矢量化器

import pickle

with open(filename, 'wb') as fout:
pickle.dump((movieVzer, movieTfmer, clf), fout)

加载模型和矢量化器以供使用
import pickle

with open('finalized_model.pkl', 'rb') as f:
movieVzer, movieTfmer, clf = pickle.load(f)

关于python-3.x - 如何使用 Countvectorizer() 和 TfidfTransformer() 在 sklearn 中保存分类器,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/58020251/

26 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com