gpt4 book ai didi

python - 具有许多输出的文本[多级]分类

转载 作者:太空狗 更新时间:2023-10-29 18:12:41 26 4
gpt4 key购买 nike

问题陈述:

将文本文档归类到其所属的类别,并将该类别最多分为两级。

样本训练集:

Description Category    Level1  Level2
The gun shooting that happened in Vegas killed two Crime | High Crime High
Donald Trump elected as President of America Politics | High Politics High
Rian won in football qualifier Sports | Low Sports Low
Brazil won in football final Sports | High Sports High

初始尝试:

我尝试创建一个分类器模型,该模型将尝试使用随机森林方法对类别进行分类,它给了我 90% 的总分。

代码1:

import pandas as pd
#import numpy as np

from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.naive_bayes import BernoulliNB
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split
#from stemming.porter2 import stem

from nltk.corpus import stopwords

from sklearn.model_selection import cross_val_score

stop = stopwords.words('english')
data_file = "Training_dataset_70k"

#Reading the input/ dataset
data = pd.read_csv( data_file, header = 0, delimiter= "\t", quoting = 3, encoding = "utf8")
data = data.dropna()

#Removing stopwords, punctuation and stemming
data['Description'] = data['Description'].apply(lambda x: ' '.join([word for word in x.split() if word not in (stop)]))
data['Description'] = data['Description'].str.replace('[^\w\s]',' ').replace('\s+',' ')
#data['Description'] = data['Description'].apply(lambda x: ' '.join([stem(word) for word in x.split()]))

train_data, test_data, train_label, test_label = train_test_split(data.Description, data.Category, test_size=0.3, random_state=100)

RF = RandomForestClassifier(n_estimators=10)
vectorizer = TfidfVectorizer( max_features = 40000, ngram_range = ( 1,3 ), sublinear_tf = True )
data_features = vectorizer.fit_transform( train_data )
RF.fit(data_features, train_label)
test_data_feature = vectorizer.transform(test_data)
Output_predict = RF.predict(test_data_feature)
print "Overall_Accuracy: " + str(np.mean(Output_predict == test_label))
with codecs.open("out_Category.txt", "w", "utf8") as out:
for inp, pred, act in zip(test_data, Output_predict, test_label):
try:
out.write("{}\t{}\t{}\n".format(inp, pred, act))
except:
continue

问题:

我想在模型中再添加两个级别,它们是 Level1 和 Level2,添加它们的原因是当我单独运行 Level1 分类时,我获得了 96% 的准确率。我坚持拆分训练和测试数据集并训练具有三个分类的模型。

是否可以创建具有三个分类的模型,还是我应该创建三个模型?如何拆分训练数据和测试数据?

编辑1: 导入字符串 导入编解码器 将 Pandas 导入为 pd 将 numpy 导入为 np

from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.naive_bayes import BernoulliNB
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split
from stemming.porter2 import stem

from nltk.stem import PorterStemmer
from nltk.corpus import stopwords

from sklearn.model_selection import cross_val_score


stop = stopwords.words('english')

data_file = "Training_dataset_70k"
#Reading the input/ dataset
data = pd.read_csv( data_file, header = 0, delimiter= "\t", quoting = 3, encoding = "utf8")
data = data.dropna()
#Removing stopwords, punctuation and stemming
data['Description'] = data['Description'].apply(lambda x: ' '.join([word for word in x.split() if word not in (stop)]))
data['Description'] = data['Description'].str.replace('[^\w\s]',' ').replace('\s+',' ')

train_data, test_data, train_label, test_label = train_test_split(data.Description, data[["Category", "Level1", "Level2"]], test_size=0.3, random_state=100)
RF = RandomForestClassifier(n_estimators=2)
vectorizer = TfidfVectorizer( max_features = 40000, ngram_range = ( 1,3 ), sublinear_tf = True )
data_features = vectorizer.fit_transform( train_data )
print len(train_data), len(train_label)
print train_label
RF.fit(data_features, train_label)
test_data_feature = vectorizer.transform(test_data)
#print test_data_feature
Output_predict = RF.predict(test_data_feature)
print "BreadCrumb_Accuracy: " + str(np.mean(Output_predict == test_label))
with codecs.open("out_bread_crumb.txt", "w", "utf8") as out:
for inp, pred, act in zip(test_data, Output_predict, test_label):
try:
out.write("{}\t{}\t{}\n".format(inp, pred, act))
except:
continue

最佳答案

scikit-learn 随机森林分类器原生支持多个输出(参见 this example)。因此,您不需要创建三个单独的模型。

来自 RandomForestClassifier.fit 的文档,fit 函数的输入是:

X : array-like or sparse matrix of shape = [n_samples, n_features]

y : array-like, shape = [n_samples] or [n_samples, n_outputs]

因此,您需要一个大小为 N x 3 的数组 y(您的标签)作为您对 RandomForestClassifier 的输入。为了拆分你的训练和测试集,你可以这样做:

train_data, test_data, train_label,  test_label = train_test_split(data.Description, data[['Category','Level 1','Level 2']], test_size=0.3, random_state=100)

你的 train_labeltest_label 应该是大小为 N x 3 的数组,你可以用它来拟合你的模型比较你的预测(注意:我没有在这里测试它,你可能需要做一些转置)。

关于python - 具有许多输出的文本[多级]分类,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/45886616/

26 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com