gpt4 book ai didi

machine-learning - 为什么以下部分拟合不起作用?

转载 作者:行者123 更新时间:2023-11-30 09:52:08 24 4
gpt4 key购买 nike

from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer

您好,我有以下评论列表:

comments = ['I am very agry','this is not interesting','I am very happy']

这些是相应的标签:

sents = ['angry','indiferent','happy']

我正在使用 tfidf 对这些评论进行矢量化,如下所示:

tfidf_vectorizer = TfidfVectorizer(analyzer='word')
tfidf = tfidf_vectorizer.fit_transform(comments)
from sklearn import preprocessing

我正在使用标签编码器对标签进行矢量化:

le = preprocessing.LabelEncoder()
le.fit(sents)
labels = le.transform(sents)
print(labels.shape)
from sklearn.linear_model import PassiveAggressiveClassifier
from sklearn.model_selection import train_test_split
with open('tfidf.pickle','wb') as idxf:
pickle.dump(tfidf, idxf, pickle.HIGHEST_PROTOCOL)
with open('tfidf_vectorizer.pickle','wb') as idxf:
pickle.dump(tfidf_vectorizer, idxf, pickle.HIGHEST_PROTOCOL)

这里我使用被动攻击来拟合模型:

clf2 = PassiveAggressiveClassifier()


with open('passive.pickle','wb') as idxf:
pickle.dump(clf2, idxf, pickle.HIGHEST_PROTOCOL)

with open('passive.pickle', 'rb') as infile:
clf2 = pickle.load(infile)

with open('tfidf_vectorizer.pickle', 'rb') as infile:
tfidf_vectorizer = pickle.load(infile)
with open('tfidf.pickle', 'rb') as infile:
tfidf = pickle.load(infile)

在这里,我尝试使用三个新注释及其相应的标签来测试部分拟合的用法,如下所示:

new_comments = ['I love the life','I hate you','this is not important']
new_labels = [1,0,2]
vec_new_comments = tfidf_vectorizer.transform(new_comments)

print(clf2.predict(vec_new_comments))
clf2.partial_fit(vec_new_comments, new_labels)

问题是部分拟合后我没有得到正确的结果,如下所示:

print('AFTER THIS UPDATE THE RESULT SHOULD BE 1,0,2??')
print(clf2.predict(vec_new_comments))

但是我得到了这个输出:

[2 2 2]

因此,我非常感谢您的支持,如果我使用与训练时相同的示例来测试模型,为什么模型没有被更新,所需的输出应该是:

[1,0,2]

我希望感谢您对调整超参数以查看所需输出的支持。

这是完整的代码,用于显示部分拟合:

from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer
import sys
from sklearn.metrics.pairwise import cosine_similarity
import random


comments = ['I am very agry','this is not interesting','I am very happy']
sents = ['angry','indiferent','happy']
tfidf_vectorizer = TfidfVectorizer(analyzer='word')
tfidf = tfidf_vectorizer.fit_transform(comments)
#print(tfidf.shape)
from sklearn import preprocessing
le = preprocessing.LabelEncoder()
le.fit(sents)
labels = le.transform(sents)

from sklearn.linear_model import PassiveAggressiveClassifier
from sklearn.model_selection import train_test_split
with open('tfidf.pickle','wb') as idxf:
pickle.dump(tfidf, idxf, pickle.HIGHEST_PROTOCOL)
with open('tfidf_vectorizer.pickle','wb') as idxf:
pickle.dump(tfidf_vectorizer, idxf, pickle.HIGHEST_PROTOCOL)

clf2 = PassiveAggressiveClassifier()

clf2.fit(tfidf, labels)


with open('passive.pickle','wb') as idxf:
pickle.dump(clf2, idxf, pickle.HIGHEST_PROTOCOL)

with open('passive.pickle', 'rb') as infile:
clf2 = pickle.load(infile)



with open('tfidf_vectorizer.pickle', 'rb') as infile:
tfidf_vectorizer = pickle.load(infile)
with open('tfidf.pickle', 'rb') as infile:
tfidf = pickle.load(infile)

new_comments = ['I love the life','I hate you','this is not important']
new_labels = [1,0,2]

vec_new_comments = tfidf_vectorizer.transform(new_comments)

clf2.partial_fit(vec_new_comments, new_labels)



print('AFTER THIS UPDATE THE RESULT SHOULD BE 1,0,2??')
print(clf2.predict(vec_new_comments))

但是我得到了:

AFTER THIS UPDATE THE RESULT SHOULD BE 1,0,2??
[2 2 2]

最佳答案

您的代码存在多个问题。我将首先阐述显而易见的问题和更复杂的问题:

  1. 您在 clf2 学习任何内容之前就对其进行了 pickle。 (即,一旦定义它就将其腌制,它没有任何作用)。如果您只是进行测试,那就没问题了。否则,应在 fit() 或等效调用之后对它们进行 pickle。
  2. 您在 clf2.partial_fit() 之前调用 clf2.fit()。这违背了 partial_fit() 的全部目的。当您调用 fit() 时,您实质上是修复了模型将学习的类(标签)。在您的情况下,这是可以接受的,因为在您随后调用 partial_fit() 时,您将给出相同的标签。但这仍然不是一个好的做法。

    See this for more details

    在partial_fit() 场景中,永远不要调用fit()。始终使用您的起始数据和新的数据调用 partial_fit()。但请确保您在第一次调用参数 classes 中的 parital_fit() 时提供了您希望模型学习的所有标签。

  3. 现在是最后一部分,关于您的 tfidf_vectorizer。您可以在 tfidf_vectorizer 上调用 fit_transform()(本质上是 fit()transformed() 组合) comments 数组。这意味着它在后续调用 transform() 时(就像您在 transform(new_comments) 中所做的那样),它不会从 new_comments 中学习新单词,而只会使用这些单词它在调用 fit() 期间看到了它(comments 中存在的单词)。

    LabelEncodersents 也是如此。

    这在在线学习场景中同样不是优选的。您应该立即拟合所有可用数据。但由于您尝试使用 partial_fit(),我们假设您有非常大的数据集,可能无法立即装入内存。因此,您也希望将某种partial_fit 应用于TfidfVectorizer。但 TfidfVectorizer 不支持 partial_fit()。事实上,它并不是为大数据而设计的。所以你需要改变你的方法。请参阅以下问题了解更多详细信息:-

抛开一切不谈,如果您仅更改拟合整个数据的 tfidf 部分(commentsnew_comments 一次),您将获得所需的结果。

请参阅以下代码更改(我可能对其进行了一些整理,并将vec_new_comments重命名为new_tfidf,请仔细阅读):

comments = ['I am very agry','this is not interesting','I am very happy']
sents = ['angry','indiferent','happy']

new_comments = ['I love the life','I hate you','this is not important']
new_sents = ['happy','angry','indiferent']

tfidf_vectorizer = TfidfVectorizer(analyzer='word')
le = preprocessing.LabelEncoder()

# The below lines are important

# I have given the whole data to fit in tfidf_vectorizer
tfidf_vectorizer.fit(comments + new_comments)

# same for `sents`, but since the labels dont change, it doesnt matter which you use, because it will be same
# le.fit(sents)
le.fit(sents + new_sents)

下面是不太优选的代码(您正在使用它,我在第 2 点中谈到过它),但只要您进行上述更改,结果就很好。

tfidf = tfidf_vectorizer.transform(comments)
labels = le.transform(sents)

clf2.fit(tfidf, labels)
print(clf2.predict(tfidf))
# [0 2 1]

new_tfidf = tfidf_vectorizer.transform(new_comments)
new_labels = le.transform(new_sents)

clf2.partial_fit(new_tfidf, new_labels)
print(clf2.predict(new_tfidf))
# [1 0 2] As you wanted

正确的方法,或者partial_fit()的使用方式:

# Declare all labels that you want the model to learn
# Using classes learnt by labelEncoder for this
# In any calls to `partial_fit()`, all labels should be from this array only

all_classes = le.transform(le.classes_)

# Notice the parameter classes here
# It needs to present first time
clf2.partial_fit(tfidf, labels, classes=all_classes)
print(clf2.predict(tfidf))
# [0 2 1]

# classes is not present here
clf2.partial_fit(new_tfidf, new_labels)
print(clf2.predict(new_tfidf))
# [1 0 2]

关于machine-learning - 为什么以下部分拟合不起作用?,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/43421889/

24 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com