gpt4 book ai didi

python - 如何在没有输出列的新文件上使用此机器学习模型?

转载 作者:行者123 更新时间:2023-11-30 09:41:46 24 4
gpt4 key购买 nike

我使用了 csv 文件中的一些数据,其中有 2 列,第一列是注释,第二列是结果。我有一个输出,但想在没有输出列的文件上测试这个模型。我该怎么做?

import csv

from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfTransformer
from sklearn.naive_bayes import BernoulliNB
from sklearn import cross_validation
from sklearn.metrics import classification_report
import numpy as np
from sklearn.metrics import accuracy_score

# review.csv contains two columns
# first column is the review content (quoted)
# second column is the assigned sentiment (positive or negative)
def load_file():
with open('review.csv') as csv_file:
reader = csv.reader(csv_file,delimiter=",",quotechar='"')
reader.next()
data =[]
target = []
for row in reader:
# skip missing data
if row[0] and row[1]:
data.append(row[0])
target.append(row[1])

return data,target

# preprocess creates the term frequency matrix for the review data set
def preprocess():
data,target = load_file()
count_vectorizer = CountVectorizer(binary='true')
data = count_vectorizer.fit_transform(data)
tfidf_data = TfidfTransformer(use_idf=False).fit_transform(data)

return tfidf_data

def learn_model(data,target):
# preparing data for split validation. 60% training, 40% test
data_train,data_test,target_train,target_test = cross_validation.train_test_split(data,target,test_size=0.4,random_state=43)
classifier = BernoulliNB().fit(data_train,target_train)
predicted = classifier.predict(data_test)
evaluate_model(target_test,predicted)

# read more about model evaluation metrics here
# http://scikit-learn.org/stable/modules/model_evaluation.html
def evaluate_model(target_true,target_predicted):
print classification_report(target_true,target_predicted)
print "The accuracy score is {:.2%}".format(accuracy_score(target_true,target_predicted))

def main():
data,target = load_file()
tf_idf = preprocess()
learn_model(tf_idf,target)


main()

我的结果是 65%。现在如何在没有输出列的新文件上测试此模型并将输出打印到新文件

最佳答案

一个简单的方法是使用 Sklearn 的 pipeline

假设您使用以下内容读取训练数据:

def read_training(filename):
# Read from a csv file with two columns. Skip bad lines
df = pd.read_csv(
filename,
error_bad_lines=False,
names=['data', 'target']
)
return df.data, df.target

您可以对新数据执行类似的操作。确保您有一个包含单列的文件。

def read_test(filename):
# Read from a csv file with a single column. Skip bad lines
df = pd.read_csv(
filename,
error_bad_lines=False,
names=['data']
)
return df.data

管道

然后,您应该使用管道使您的函数更加动态。请参阅下面的代码,该代码很容易阅读。它没有像您显示的那样显示评分步骤。

from sklearn.pipeline import Pipeline
import numpy as np

def main():
# Read training file
train_data, train_target = read_training('review.csv')

# Prepare all sklearn functions in a single pipeline
pipeline = Pipeline([
('count_vectorizer', CountVectorizer(binary='true')),
('tf_idf_transformer', TfidfTransformer(use_idf=False)),
('bernoulli_nb', BernoulliNB())
])

# This trains the entire pipeline on your training data
pipeline.fit(train_data, train_target)

# Your pipeline is now ready to apply to new data!
test_data = read_test('test.csv')
prediction = pipeline.predict(test_data)

# Write prediction to file
np.savetxt("prediction.csv", prediction, delimiter=",", fmt="%s")

关于python - 如何在没有输出列的新文件上使用此机器学习模型?,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/57709950/

24 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com