- html - 出于某种原因,IE8 对我的 Sass 文件中继承的 html5 CSS 不友好?
- JMeter 在响应断言中使用 span 标签的问题
- html - 在 :hover and :active? 上具有不同效果的 CSS 动画
- html - 相对于居中的 html 内容固定的 CSS 重复背景?
多标签分类
我正在尝试使用 scikit-learn/pandas/OneVsRestClassifier/logistic 回归来预测多标签分类。构建和评估模型有效,但尝试对新示例文本进行分类则无效。
场景 1:
一旦我构建了一个模型,就会使用名称(sample.pkl)保存模型并重新启动内核,但是当我在预测示例文本期间加载保存的模型(sample.pkl)时,会出现错误:
NotFittedError: TfidfVectorizer - Vocabulary wasn't fitted.
我构建模型并评估模型,然后将其保存为名称为sample.pkl的模型。我重新调整我的内核,然后加载模型,对示例文本 NotFittedError: TfidfVectorizer - Vocabulary was notfitting 进行预测
推理
import pickle,os
import collections
import numpy as np
import pandas as pd
import seaborn as sns
from tqdm import tqdm
import matplotlib.pyplot as plt
from collections import Counter
from nltk.corpus import stopwords
import json, nltk, re, csv, pickle
from sklearn.metrics import f1_score # performance matrix
from sklearn.multiclass import OneVsRestClassifier # binary relavance
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.preprocessing import MultiLabelBinarizer
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
stop_words = set(stopwords.words('english'))
def cleanHtml(sentence):
'''' remove the tags '''
cleanr = re.compile('<.*?>')
cleantext = re.sub(cleanr, ' ', str(sentence))
return cleantext
def cleanPunc(sentence):
''' function to clean the word of any
punctuation or special characters '''
cleaned = re.sub(r'[?|!|\'|"|#]',r'',sentence)
cleaned = re.sub(r'[.|,|)|(|\|/]',r' ',cleaned)
cleaned = cleaned.strip()
cleaned = cleaned.replace("\n"," ")
return cleaned
def keepAlpha(sentence):
""" keep the alpha sentenes """
alpha_sent = ""
for word in sentence.split():
alpha_word = re.sub('[^a-z A-Z]+', ' ', word)
alpha_sent += alpha_word
alpha_sent += " "
alpha_sent = alpha_sent.strip()
return alpha_sent
def remove_stopwords(text):
""" remove stop words """
no_stopword_text = [w for w in text.split() if not w in stop_words]
return ' '.join(no_stopword_text)
test1 = pd.read_csv("C:\\Users\\abc\\Downloads\\test1.csv")
test1.columns
test1.head()
siNo plot movie_name genre_new
1 The story begins with Hannah... sing [drama,teen]
2 Debbie's favorite band is Dream.. the bigeest fan [drama]
3 This story of a Zulu family is .. come back,africa [drama,Documentary]
出现错误当我对示例文本进行推断时,我收到错误
def infer_tags(q):
q = cleanHtml(q)
q = cleanPunc(q)
q = keepAlpha(q)
q = remove_stopwords(q)
multilabel_binarizer = MultiLabelBinarizer()
tfidf_vectorizer = TfidfVectorizer()
q_vec = tfidf_vectorizer.transform([q])
q_pred = clf.predict(q_vec)
return multilabel_binarizer.inverse_transform(q_pred)
for i in range(5):
print(i)
k = test1.sample(1).index[0]
print("Movie: ", test1['movie_name'][k], "\nPredicted genre: ", infer_tags(test1['plot'][k])), print("Actual genre: ",test1['genre_new'][k], "\n")
已解决
我解决了将 tfidf 和 multibiniraze 保存到 pickle 模型中的问题
from sklearn.externals import joblib
pickle.dump(tfidf_vectorizer, open("tfidf_vectorizer.pickle", "wb"))
pickle.dump(multilabel_binarizer, open("multibinirizer_vectorizer.pickle", "wb"))
vectorizer = joblib.load('/abc/downloads/tfidf_vectorizer.pickle')
multilabel_binarizer = joblib.load('/abc/downloads/multibinirizer_vectorizer.pickle')
def infer_tags(q):
q = cleanHtml(q)
q = cleanPunc(q)
q = keepAlpha(q)
q = remove_stopwords(q)
q_vec = vectorizer .transform([q])
q_pred = rf_model.predict(q_vec)
return multilabel_binarizer.inverse_transform(q_pred)
我通过下面的链接得到了解决方案, How do I store a TfidfVectorizer for future use in scikit-learn? >
最佳答案
发生这种情况是因为您仅将分类器转储到pickle中,而不是向量化器中。
在推理过程中,当你调用时
tfidf_vectorizer = TfidfVectorizer()
,您的矢量化器未适合训练词汇表,这会产生错误。
您应该做的是将分类器和矢量化器转储到 pickle。在推理过程中加载它们。
关于python-3.x - 加载pickle NotFittedError : TfidfVectorizer - Vocabulary wasn't fitted,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/57213165/
我正在使用 sklearn Pipeline 和 FeatureUnion 从文本文件创建特征,我想打印出特征名称。 首先,我将所有转换收集到一个列表中。 In [225]:components Ou
基本上,我正在尝试为多元线性回归模型创建一个 GUI,用户可以在其中添加一些值并从中获取电影分数。这是代码: import tkinter as tk import sklearn as sk roo
我正在尝试在 XGBoost 拟合模型上使用 sklearn plot_partial_dependence 函数,即在调用 .fit 之后。但我不断收到错误消息: 未安装错误:此 XGBRegres
我正在尝试学习如何通过 sklearn 处理文本数据,但遇到了一个无法解决的问题。 我正在关注的教程是:http://scikit-learn.org/stable/tutorial/text_ana
我尝试了随机森林回归。 代码如下。 import numpy as np from sklearn.preprocessing import StandardScaler from sklearn.m
多标签分类 我正在尝试使用 scikit-learn/pandas/OneVsRestClassifier/logistic 回归来预测多标签分类。构建和评估模型有效,但尝试对新示例文本进行分类则无效
我正在尝试使用 sklearn.inspection.plot_partial_dependence 创建部分依赖图在我使用 keras 和 keras sklearn 包装实用程序成功构建的模型上(
我是一名优秀的程序员,十分优秀!