gpt4 book ai didi

python - 如何向当前词袋分类添加另一个特征(文本长度)? Scikit学习

转载 作者:太空狗 更新时间:2023-10-29 17:07:37 28 4
gpt4 key购买 nike

我正在使用词袋对文本进行分类。它运作良好,但我想知道如何添加一个不是单词的功能。

这是我的示例代码。

import numpy as np
from sklearn.pipeline import Pipeline
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.svm import LinearSVC
from sklearn.feature_extraction.text import TfidfTransformer
from sklearn.multiclass import OneVsRestClassifier

X_train = np.array(["new york is a hell of a town",
"new york was originally dutch",
"new york is also called the big apple",
"nyc is nice",
"the capital of great britain is london. london is a huge metropolis which has a great many number of people living in it. london is also a very old town with a rich and vibrant cultural history.",
"london is in the uk. they speak english there. london is a sprawling big city where it's super easy to get lost and i've got lost many times.",
"london is in england, which is a part of great britain. some cool things to check out in london are the museum and buckingham palace.",
"london is in great britain. it rains a lot in britain and london's fogs are a constant theme in books based in london, such as sherlock holmes. the weather is really bad there.",])
y_train = [[0],[0],[0],[0],[1],[1],[1],[1]]

X_test = np.array(["it's a nice day in nyc",
'i loved the time i spent in london, the weather was great, though there was a nip in the air and i had to wear a jacket.'
])
target_names = ['Class 1', 'Class 2']

classifier = Pipeline([
('vectorizer', CountVectorizer(min_df=1,max_df=2)),
('tfidf', TfidfTransformer()),
('clf', OneVsRestClassifier(LinearSVC()))])
classifier.fit(X_train, y_train)
predicted = classifier.predict(X_test)
for item, labels in zip(X_test, predicted):
print '%s => %s' % (item, ', '.join(target_names[x] for x in labels))

现在很明显,关于伦敦的文本往往比关于纽约的文本长得多。我如何将文本的长度添加为特征?我是否必须使用另一种分类方式,然后将这两个预测结合起来?有什么办法可以和词袋一起做吗?一些示例代码会很棒——我对机器学习和 scikit 学习还很陌生。

最佳答案

如评论中所示,这是 FunctionTransformerFeaturePipelineFeatureUnion 的组合。

import numpy as np
from sklearn.pipeline import Pipeline, FeatureUnion
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.svm import LinearSVC
from sklearn.feature_extraction.text import TfidfTransformer
from sklearn.multiclass import OneVsRestClassifier
from sklearn.preprocessing import FunctionTransformer

X_train = np.array(["new york is a hell of a town",
"new york was originally dutch",
"new york is also called the big apple",
"nyc is nice",
"the capital of great britain is london. london is a huge metropolis which has a great many number of people living in it. london is also a very old town with a rich and vibrant cultural history.",
"london is in the uk. they speak english there. london is a sprawling big city where it's super easy to get lost and i've got lost many times.",
"london is in england, which is a part of great britain. some cool things to check out in london are the museum and buckingham palace.",
"london is in great britain. it rains a lot in britain and london's fogs are a constant theme in books based in london, such as sherlock holmes. the weather is really bad there.",])
y_train = np.array([[0],[0],[0],[0],[1],[1],[1],[1]])

X_test = np.array(["it's a nice day in nyc",
'i loved the time i spent in london, the weather was great, though there was a nip in the air and i had to wear a jacket.'
])
target_names = ['Class 1', 'Class 2']


def get_text_length(x):
return np.array([len(t) for t in x]).reshape(-1, 1)

classifier = Pipeline([
('features', FeatureUnion([
('text', Pipeline([
('vectorizer', CountVectorizer(min_df=1,max_df=2)),
('tfidf', TfidfTransformer()),
])),
('length', Pipeline([
('count', FunctionTransformer(get_text_length, validate=False)),
]))
])),
('clf', OneVsRestClassifier(LinearSVC()))])

classifier.fit(X_train, y_train)
predicted = classifier.predict(X_test)
predicted

这会将文本的长度添加到分类器使用的特征中。

关于python - 如何向当前词袋分类添加另一个特征(文本长度)? Scikit学习,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/39121104/

28 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com