python - 为 countvectorizer 加载 pickle 文件-6ren

python - 为 countvectorizer 加载 pickle 文件

转载作者：行者123 更新时间：2023-11-30 22:31:51

25

4

我有训练模型并保存了pickle文件，但是当我尝试将其加载到新数据上时，我收到错误">>>回溯(最近一次调用最后一次): 文件“”，第 1 行，在“

请引用下面的脚本，我在其中训练了保存的pickle文件的数据。

# Import the pandas package, then use the "read_csv" function to read
# the labeled training data
import os
import pandas as pd       
from bs4 import BeautifulSoup
import re
import nltk
from nltk.corpus import stopwords # Import the stop word list
from nltk.stem.snowball import SnowballStemmer
from nltk.tokenize import word_tokenize
from sklearn.feature_extraction.text import CountVectorizer
import numpy as np
from sklearn.ensemble import RandomForestClassifier
from sklearn.naive_bayes import MultinomialNB
from sklearn import svm
from sklearn.grid_search import GridSearchCV
import pickle

##Set working directory
os.getcwd()
os.chdir("C:/Prediction")

##Read history data file
train = pd.read_csv("C:/Prediction/Past.csv",encoding='cp1252')

##Text Cleanng keeping only key words/ stemmming 
stemmer = SnowballStemmer('english')

def Description_to_words(raw_Description):
    #1. Remove HTML.    
    Description_text = BeautifulSoup(raw_Description).get_text() 
    #2. Remove non-letters: 
    #letters_only = re.sub("[^\w\s]", " ", Description_text)
    letters_only = re.sub("[^a-zA-Z]", " ", Description_text)
    #3. Convert to lower case
    words = word_tokenize(letters_only.lower())    
    #4. Remove stop words
    stops = set(stopwords.words("english")) 
    meaningful_words = [w for w in words if not w in stops]
    #5Stem words. Another issue. Stem meaningful_words, not words.
    return( " ".join(stemmer.stem(w) for w in meaningful_words))

# Get the number of Descriptions based on the dataframe column size
num_Descriptions = train["Description"].size

# Initialize an empty list to hold the clean Descriptions
clean_train_Descriptions = []

# Loop over each Description; create an index i that goes from 0 to the length
# of the Ticket Description list 

print("Cleaning and parsing the training set ticket Descriptions...\n")
clean_train_Descriptions = []
for i in range( 0, num_Descriptions ):
    # If the index is evenly divisible by 1000, print a message
    if( (i+1)%1000 == 0 ):
        print("Description %d of %d\n" % ( i+1, num_Descriptions ))
    # Call our function for each one, and add the result to the list of
    # clean Descriptions
    clean_train_Descriptions.append(Description_to_words( train["Description"][i] ))
##Text Cleanng keeping only key words/ stemmming 

# Initialize the "CountVectorizer" object, which is scikit-learn's
# bag of words tool.  
vectorizer = CountVectorizer(analyzer = "word",   \
                             tokenizer = None,    \
                             preprocessor = None, \
                             stop_words = None,   \
                             max_features = 5000,   \
                             ngram_range=(1,2)) 

# fit_transform() does two functions: First, it fits the model
# and learns the vocabulary; second, it transforms our training data
# into feature vectors. The input to fit_transform should be a list of 
# strings.
train_data_features = vectorizer.fit_transform(clean_train_Descriptions)

# Numpy arrays are easy to work with, so convert the result to an 
# array
train_data_features = train_data_features.toarray()

# Random Forest classifier with 100 trees
forest = RandomForestClassifier(n_estimators = 100) 
forest = forest.fit(train_data_features, train["Group"])

###save picle file 
pickle.dump(train_data_features, open("vector.pickel","wb"))
pickle.dump(forest, open("classifier-rf.pickel","wb"))

但是当我加载 vector.pickel 文件以在我得到的新数据集上创建 test_data_features 时。错误。任何人都可以帮助我解决这个问题，或者每次我必须在预测新数据集时训练模型。请指教。

# Read the test data
test = pd.read_csv("C:/New.csv",encoding='cp1252')

# Create an empty list and append the clean Descriptions one by one
num_Descriptions = len(test["Description"])
clean_test_Descriptions = [] 

print("Cleaning and parsing the test set movie Descriptions...\n")
for i in range(0,num_Descriptions):
    if( (i+1) % 1000 == 0 ):
        print("Description %d of %d\n" % (i+1, num_Descriptions))
    clean_Description = Description_to_words( test["Description"][i] )
    clean_test_Descriptions.append( clean_Description )

# Get a bag of words for the test set, and convert to a numpy array
vect1 = CountVectorizer(analyzer = "word",   \
                             tokenizer = None,    \
                             preprocessor = None, \
                             stop_words = None,   \
                             max_features = 5000,   \
                             ngram_range=(1,2)) 

vect1=pickle.load(open("vector.pickel","rb"))
test_data_features = vect1.transform(clean_test_Descriptions)

最佳答案

您 pickle 了错误的对象。在进行 pickle 的部分中，您正在 pickle 作为 CountVectorizer 转换器返回的结果的矩阵。

您需要做的是 pickle 矢量化器:

# create CountVectorizer transformer
vectorizer = CountVectorizer(analyzer="word",
                             tokenizer=None,
                             preprocessor=None,
                             stop_words=None,
                             max_features=5000,
                             ngram_range=(1, 2))

# fit on training data
# assuming clean_train_Descriptions is training set
vectorizer.fit(clean_train_Descriptions)

# now pickle
pickle.dump(vectorizer, open("vector.pickel", "wb"))

现在，当您需要评分时，只需加载对象并根据新数据进行评分

# load pickle
vectorizer = pickle.load(open("vector.pickel", "rb"))

# score
# assuming clean_test_Descriptions is the test set
test_data_features = vectorizer.transform(clean_test_Descriptions)

关于python - 为 countvectorizer 加载 pickle 文件，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/45674411/

25

4

0

文章推荐： python - 向此类 python 函数传递更多参数

文章推荐： php - 导入制表符分隔的文本以更新远程机器上的 MySQL 表

文章推荐： php - 带参数的 Doctrine2 DQL TRIM

文章推荐： python - 更改Python中的重复值

python - CountVectorizer 但对于文本组
使用以下代码，CountVectorizer 将“风干肉”分解为 3 个不同的向量。但我想要的是将“风干肉”保留为 1 个向量。我该怎么做？我运行的代码: from sklearn.feature_
python - CountVectorizer 中的样本数量不一致
我正在尝试对我拥有的一组推文使用多项式朴素贝叶斯分类。这是我的代码: import codecs from sklearn.feature_extraction.text import CountV
python - CountVectorizer 删除只出现一次的特征
我正在使用 sklearn python 包，我在使用预先创建的字典创建 CountVectorizer 时遇到问题，其中 CountVectorizer 不会删除以下功能只出现一次或根本不出现。这
python - CountVectorizer 给出错误的单词计数？
假设我的文本文件包含以下文本: The quick brown fox jumped over the lazy dogs. A stitch in time saves nine. The quic
python - CountVectorizer 矩阵随新的分类测试数据变化？
我已经使用 python 创建了一个文本分类模型。我有 CountVectorizer，它会生成 2034 行和 4063 列(唯一单词)的文档术语矩阵。我保存了用于新测试数据的模型。我的新测试数据
python - CountVectorizer 忽略大写
CountVectorizer 忽略大写单词的原因是什么？ cv = CountVectorizer(stop_words=None,analyzer='word',token_pattern='.*
python - CountVectorizer 在短词上引发错误
有人会尝试向我解释为什么当我尝试 fit_transform 任何短词时 CountVectorizer 会引发此错误吗？即使我使用 stopwords=None 我仍然会得到同样的错误。这是代码 f
Python:CountVectorizer 忽略一个字母单词 "I"
我有一个名为 dictionary1 的列表。我使用以下代码获取文本的稀疏计数矩阵: cv1 = sklearn.feature_extraction.text.CountVectorizer(sto
python - CountVectorizer，第二次使用相同的词汇表
这是我的数据集: anger,happy food food anger,dog food food disgust,food happy food disgust,food dog food neu
python - 整数列表上的 CountVectorizer
我有如下整数列表: mylist = [111,113,114,115,112,115,234,643,565,.....] 我有很多这样的列表，其中包含超过 500 个整数，我想在其上运行 Coun
python - CountVectorizer 的单个字母的空词汇表
尝试将字符串转换为数值向量， ### Clean the string def names_to_words(names): print('a') words = re.sub("[^
python - CountVectorizer 将单词转换为小写
在我的分类模型中，我需要保留大写字母，但是当我使用 sklearn countVectorizer 构建词汇表时，大写字母转换为小写字母! 为了排除隐式分词，我构建了一个分词器，它只传递文本而无需任何
python - CountVectorizer 变换后出现意外的稀疏矩阵
我是 NLTK 的新人，在创建评论分类器时遇到问题。当作为输入传递的数据的形状为 (10000,1) 时，我无法理解转换后的数据的形状如何是 1*1 稀疏矩阵我对原始评论数据进行了一些处理。比如删除
python - CountVectorizer 只返回零
我正在尝试从给定的文档中提取一些特征，给定一组预定义的特征。 from sklearn.feature_extraction.text import CountVectorizer features
python - CountVectorizer 不打印词汇表
我已经安装了 python 2.7、numpy 1.9.0、scipy 0.15.1 和 scikit-learn 0.15.2。现在，当我在 python 中执行以下操作时: train_set =
machine-learning - CountVectorizer 如何处理测试数据中的新词？
我了解 CountVectorizer 的一般工作原理。它获取单词标记并创建文档(行)和标记计数(列)的稀疏计数矩阵，我们可以将其用于 ML 建模。但是，它如何处理可能出现在测试数据中但未出现在训练
python - CountVectorizer token_pattern 不捕捉下划线
CountVectorizer 默认标记模式将下划线定义为字母 corpus = ['The rain in spain_stays' ] vectorizer = CountVectorizer(t
scikit-learn - CountVectorizer 上的词形还原不会删除停用词
我正在尝试将 Lematization 添加到来自 Skit-learn 的 CountVectorizer，如下 import nltk from pattern.es import lemma f
regex - 在 CountVectorizer 上使用正则表达式删除数字和符号
目前，我有一个 CountVectorizer 函数 CountVectorizer(stop_words=stopwords.words('spanish'),token_pattern=r'(?u
python - 如何在 countVectorizer 中将带有小数或逗号的数字视为一个单词
我正在清理文本，然后将其传递给 CountVectorizer 函数，以计算每个单词在文本中出现的次数。问题在于它将 10,000x 视为两个单词(10 和 000x)。同样，对于 5.00，它将 5

首页

博学

6Ren·AI

商城

python - 为 countvectorizer 加载 pickle 文件