gpt4 book ai didi

machine-learning - 机器学习模型如何处理看不见的数据和看不见的标签?

转载 作者:行者123 更新时间:2023-11-30 08:27:23 25 4
gpt4 key购买 nike

我正在尝试解决文本分类问题。我有有限数量的标签来捕获我的文本数据的类别。如果传入的文本数据不适合任何标签,则会被标记为“其他”。在下面的示例中,我构建了一个文本分类器,将文本数据分类为“早餐”或“意大利语”。在测试场景中,我包含了一些不适合我用于训练的标签的文本数据。这就是我面临的挑战。理想情况下,我希望模型说 - “其他”代表“我喜欢徒步旅行”和“每个人都应该理解数学”。我怎样才能做到这一点?

import numpy as np
from sklearn.pipeline import Pipeline
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.feature_extraction.text import TfidfTransformer

X_train = np.array(["coffee is my favorite drink",
"i like to have tea in the morning",
"i like to eat italian food for dinner",
"i had pasta at this restaurant and it was amazing",
"pizza at this restaurant is the best in nyc",
"people like italian food these days",
"i like to have bagels for breakfast",
"olive oil is commonly used in italian cooking",
"sometimes simple bread and butter works for breakfast",
"i liked spaghetti pasta at this italian restaurant"])

y_train_text = ["breakfast","breakfast","italian","italian","italian",
"italian","breakfast","italian","breakfast","italian"]

X_test = np.array(['this is an amazing italian place. i can go there every day',
'i like this place. i get great coffee and tea in the morning',
'bagels are great here',
'i like hiking',
'everyone should understand maths'])

classifier = Pipeline([
('vectorizer', CountVectorizer()),
('tfidf', TfidfTransformer()),
('clf', MultinomialNB())])

classifier.fit(X_train, y_train_text)
predicted = classifier.predict(X_test)
proba = classifier.predict_proba(X_test)
print(predicted)
print(proba)

['italian' 'breakfast' 'breakfast' 'italian' 'italian']
[[0.25099411 0.74900589]
[0.52943091 0.47056909]
[0.52669142 0.47330858]
[0.42787443 0.57212557]
[0.4 0.6 ]]

我认为“其他”类别是噪音,我无法对此类别进行建模。

最佳答案

我认为 Kalsi 可能建议了这一点,但我并不清楚。您可以为您的类定义一个置信阈值。如果预测概率未达到任何类别(示例中的“意大利”和“早餐”)的阈值,则您无法对产生“其他”“类别”的样本进行分类。

我说“类”是因为 other 并不完全是一个类。您可能不希望您的分类器擅长预测“其他”,因此此置信度阈值可能是一个好方法。

关于machine-learning - 机器学习模型如何处理看不见的数据和看不见的标签?,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/52371951/

25 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com