gpt4 book ai didi

python - 在 Sklearn Pipeline 中组合功能

转载 作者:太空宇宙 更新时间:2023-11-04 01:50:46 25 4
gpt4 key购买 nike

我想使用包含 TfidfVectorizerSVC 的管道。然而,在这两者之间,我想将从非文本数据中提取的一些特征连接到 TfidfVectorizer 的输出。

我已经尝试创建一个自定义类(基于此 tutorial 的方法)来执行此操作,但这似乎不起作用。

这是我到目前为止尝试过的:

pipeline = Pipeline([
('tfidf', TfidfVectorizer()),
('transformer', CustomTransformer(one_hot_feats)),
('clf', MultinomialNB()),
])

parameters = {
'tfidf__min_df': (5, 10, 15, 20, 25, 30),
'tfidf__max_df': (0.8, 0.9, 1.0),
'tfidf__ngram_range': ((1, 1), (1, 2)),
'tfidf__norm': ('l1', 'l2'),
'clf__alpha': np.linspace(0.1, 1.5, 15),
'clf__fit_prior': [True, False],
}

grid_search = GridSearchCV(pipeline, parameters, cv=5, n_jobs=-1, verbose=1)
grid_search.fit(df["short description"], labels)

这是CustomTransformer

class CustomTransformer(TransformerMixin):
"""Class that concatenates the one hot encode category feature with the tfidf data."""

def __init__(self, one_hot_features):
"""Initializes an instance of our custom transformer."""
self.one_hot_features = one_hot_features

def fit(self, X, y=None, **kwargs):
"""Dummy fit function that does nothing particular."""

return self

def transform(self, X, y=None, **kwargs):
"""Adds our external features"""
return numpy.hstack((one_hot_feats, X))

只要 X 不更改自定义类中的维度(可能是与 TransformerMixin 相关的限制),此方法就有效,但是,在我的例子中,我将在我的数据中附加其他功能。我的自定义类应该从不同的基类继承还是有不同的方法来解决这个问题?

最佳答案

您可以使用 Sklearn 的 FeatureUnion 组合多个功能,并使用 ColumnTransformer 转换特定列:

来自文档:

FeatureUnion

Concatenates results of multiple transformer objects.

This estimator applies a list of transformer objects in parallel tothe input data, then concatenates the results. This is useful tocombine several feature extraction mechanisms into a singletransformer.

ColumnTransformer

Applies transformers to columns of an array or pandas DataFrame.

This estimator allows different columns or column subsets of the inputto be transformed separately and the features generated by eachtransformer will be concatenated to form a single feature space. Thisis useful for heterogeneous or columnar data, to combine severalfeature extraction mechanisms or transformations into a singletransformer.

在您的情况下,您可以使用 make_column_transformer 来做到这一点

from sklearn.compose import make_column_transformer
pipeline = Pipeline([
('transformer', make_column_transformer((TfidfVectorizer(), ['text_column']),
(OneHotEncoder(), ['categorical_column']),)),
('clf', MultinomialNB()),
])

编辑:

make_column_transformer 中将 remainder 设置为 'passthrough'因此所有未在转换器中指定的剩余列将自动通过。

关于python - 在 Sklearn Pipeline 中组合功能,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/58076004/

25 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com