python - Scikit-Learn/Pandas : make a prediction using a saved model based on user input

转载作者：行者123 更新时间：2023-11-30 09:32:49

我正在使用 Pandas 构建机器学习模型，但很难应用我的模型来测试用户输入的数据。我的数据基本上是一个包含两列的数据框:文本和情感。我希望能够预测用户输入的情绪。这就是我所做的:

1。训练/测试模型

# reading dataset
df = pd.read_csv('dataset/dataset.tsv', sep='\t')
# splitting training/test set
test_size = 0.1
train_x, valid_x, train_y, valid_y = model_selection.train_test_split(df['text'], df['sentiment'], test_size=test_size)

# label encode the target variable (i.e. negative = 0, positive = 1)
encoder = preprocessing.LabelEncoder()
train_y = encoder.fit_transform(train_y)
valid_y = encoder.fit_transform(valid_y)

# create a count vectorizer object 
count_vect = CountVectorizer(analyzer='word', token_pattern=r'\w{1,}')
count_vect.fit(df['text'])

# transform the training and validation data using count vectorizer object
xtrain_count = count_vect.transform(train_x)
xvalid_count = count_vect.transform(valid_x)

# function to train the model
def train_model(classifier, feature_vector_train, label, feature_vector_valid, name):
    # fit the training dataset on the classifier
    classifier.fit(feature_vector_train, label)
    # save the trained model in the "models" folder
    joblib.dump(classifier, 'models/' + name + '.pkl') 

    # predict the labels on validation dataset
    predictions = classifier.predict(feature_vector_valid)

    return metrics.accuracy_score(predictions, valid_y)

# Naive Bayes on Count Vectors
accuracy = train_model(naive_bayes.MultinomialNB(), xtrain_count, train_y, xvalid_count, 'NB-COUNT')
print("NB, Count Vectors: ", accuracy)

一切正常，准确率约为 80%

2。根据用户输入测试模型

然后我再次读取保存的模型，获取用户输入并尝试进行预测(用户输入现在在 input_text 中进行硬编码):

clf = joblib.load('models/NB-COUNT.pkl')
dataset_df = pd.read_csv('dataset/dataset.tsv', sep='\t')
input_text = 'stackoverflow is the best'  # the sentence I want to predict the sentiment for
test_df = pd.Series(data=input_text)

count_vect = CountVectorizer(analyzer='word', token_pattern=r'\w{1,}')
count_vect.fit(dataset_df['text'])  # fit the count vectorizer again so we can extract features from test_df
features = count_vect.transform(test_df)
result = clf.predict(features)[0]
print(result)

但我得到的错误是“尺寸不匹配”:

Traceback (most recent call last):
File "C:\Users\vdvax\iCloudDrive\Freelance\09. Arabic Sentiment Analysis\test.py", line 20, in <module>
result = clf.predict(features)[0]
File "C:\Python36\lib\site-packages\sklearn\naive_bayes.py", line 66, in predict
jll = self._joint_log_likelihood(X)
File "C:\Python36\lib\site-packages\sklearn\naive_bayes.py", line 725, in _joint_log_likelihood
return (safe_sparse_dot(X, self.feature_log_prob_.T) +
File "C:\Python36\lib\site-packages\sklearn\utils\extmath.py", line 135, in safe_sparse_dot
ret = a * b
File "C:\Python36\lib\site-packages\scipy\sparse\base.py", line 515, in __mul__
raise ValueError('dimension mismatch')
ValueError: dimension mismatch

最佳答案

您收到尺寸不匹配错误，因为 CountVectorizer 转换的输出在尺寸上与拟合估计器中的预期形状不匹配。这是因为您在测试数据上安装了单独的 CountVectorizer。

Scikit-learn 提供了一个名为 Pipeline 的便捷界面这将允许您将预处理器和估计器堆叠在一个估计器类中。您应该在估计器之前将所有变压器放入管道中，然后您的测试数据将由预拟合变压器类进行转换。以下是如何适应估算器的管道版本:

from sklearn.pipeline import Pipeline

# takes a list of tuples where the first arg is the step name,
# and the second is the estimator itself.
pipe = Pipeline([
    ('cvec', CountVectorizer(analyzer='word', token_pattern=r'\w{1,}')),
    ('clf', naive_bayes.MultinomialNB())
])

# you can fit a pipeline in the same way you would any other estimator,
# and it will go sequentially through every stage
pipe.fit(train_x, train_y)

# you can produce predictions by feeding your test data into the pipe
pipe.predict(test_x)

请注意，您也不必以这种方式在预处理的各个阶段创建大量数据副本，因为一个阶段的输出会直接输入到下一阶段。

现在，解决你的持久性问题。管道可以采用与其他模型相同的方式进行持久化:

joblib.dump(pipe, 'models/NB-COUNT.pkl')
loaded_model = joblib.load('models/NB-COUNT.pkl')
loaded_model.predict(test_df)

关于python - Scikit-Learn/Pandas : make a prediction using a saved model based on user input，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/51444758/