gpt4 book ai didi

machine-learning - 文本分类期间的多个输入参数 - Scikit learn

转载 作者:行者123 更新时间:2023-11-30 08:42:23 24 4
gpt4 key购买 nike

我是机器学习新手。我正在尝试做一些文本分类。 “CleanDesc”有文本句子。而‘输出’则有相应的输出。最初我尝试使用一个输入参数,即文本字符串(newMerged.cleanDesc)和一个输出参数(newMerged.output)

finaldata = newMerged[['id','CleanDesc','type','output']]

count_vect = CountVectorizer()
X_train_counts = count_vect.fit_transform(newMerged.CleanDesc)

tfidf_transformer = TfidfTransformer()
X_train_tfidf = tfidf_transformer.fit_transform(X_train_counts)

clf = MultinomialNB().fit(X_train_tfidf, newMerged.output)
testdata = newMerged.ix[1:200]
X_test_counts = count_vect.transform(testdata.CleanDesc)
X_test_tfidf = tfidf_transformer.transform(X_test_counts)

predicted = clf.predict(X_new_tfidf)

这很好用。但准确率很低。我想再包含一个参数(newMerged.type)作为输入,以及文本以尝​​试改进它。我可以这样做吗?我该怎么做。 newMerged.type 不是文本。它只是一个像“HT”这样的两个字符串。我尝试按照以下方式进行操作,但失败了,

finaldata = newMerged[['id','CleanDesc','type','output']]

count_vect = CountVectorizer()
X_train_counts = count_vect.fit_transform(newMerged.CleanDesc)
tfidf_transformer = TfidfTransformer()
X_train_tfidf = tfidf_transformer.fit_transform(X_train_counts)

clf = MultinomialNB().fit([[X_train_tfidf,newMerged.type]],
newMerged.output)
testdata = newMerged.ix[1:200]
X_test_counts = count_vect.transform(testdata.CleanDesc)
X_test_tfidf = tfidf_transformer.transform(X_test_counts)

predicted = clf.predict([[X_new_tfidf, testdata.type]])

最佳答案

您必须使用 sicpy 中的 hstack 将数组附加到稀疏矩阵。

试试这个!

from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.preprocessing import LabelBinarizer
from scipy.sparse import hstack
corpus = [
'This is the first document.',
'This document is the second document.',
'And this is the third one.',
'Is this the first document?',
]
vectorizer = TfidfVectorizer()
X = vectorizer.fit_transform(corpus)
print(vectorizer.get_feature_names())

print(X.shape)

#

['and', 'document', 'first', 'is', 'one', 'second', 'the', 'third', 'this']
(4, 9)

您需要对分类变量进行编码。

cat_varia= ['s','ut','ss','ss']
lb=LabelBinarizer()
feature2=lb.fit_transform(cat_varia)

appended_X = hstack((X, feature2))

import pandas as pd
pd.DataFrame(appended_X.toarray())

#

0 1 2 3 4 5 6 7 8 9 10 11
0 0.000000 0.469791 0.580286 0.384085 0.000000 0.000000 0.384085 0.000000 0.384085 1.0 0.0 0.0
1 0.000000 0.687624 0.000000 0.281089 0.000000 0.538648 0.281089 0.000000 0.281089 0.0 0.0 1.0
2 0.511849 0.000000 0.000000 0.267104 0.511849 0.000000 0.267104 0.511849 0.267104 0.0 1.0 0.0
3 0.000000 0.469791 0.580286 0.384085 0.000000 0.000000 0.384085 0.000000 0.384085 0.0 1.0 0.0

关于machine-learning - 文本分类期间的多个输入参数 - Scikit learn,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/54357984/

24 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com