gpt4 book ai didi

python - 如何在 sklearn 管道中获取通过特征消除选择的特征名称?

转载 作者:太空狗 更新时间:2023-10-29 20:22:43 26 4
gpt4 key购买 nike

我在我的 sklearn 管道中使用递归特征消除,管道看起来像这样:

from sklearn.pipeline import FeatureUnion, Pipeline
from sklearn import feature_selection
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.svm import LinearSVC

X = ['I am a sentence', 'an example']
Y = [1, 2]
X_dev = ['another sentence']

# classifier
LinearSVC1 = LinearSVC(tol=1e-4, C = 0.10000000000000001)
f5 = feature_selection.RFE(estimator=LinearSVC1, n_features_to_select=500, step=1)

pipeline = Pipeline([
('features', FeatureUnion([
('tfidf', TfidfVectorizer(ngram_range=(1, 3), max_features= 4000)),
('custom_features', CustomFeatures())])),
('rfe_feature_selection', f5),
('clf', LinearSVC1),
])

pipeline.fit(X, Y)
y_pred = pipeline.predict(X_dev)

如何获取RFE选择的特征的特征名称? RFE 应该选择最好的 500 个特征,但我真的需要看看都选择了哪些特征。

编辑:

我有一个复杂的管道,它由多个管道和特征联合、百分位特征选择和最后的递归特征消除组成:

fs = feature_selection.SelectPercentile(feature_selection.chi2, percentile=90)
fs_vect = feature_selection.SelectPercentile(feature_selection.chi2, percentile=80)
f5 = feature_selection.RFE(estimator=svc, n_features_to_select=600, step=3)

countVecWord = TfidfVectorizer(ngram_range=(1, 3), max_features=2000, analyzer=u'word', sublinear_tf=True, use_idf = True, min_df=2, max_df=0.85, lowercase = True)
countVecWord_tags = TfidfVectorizer(ngram_range=(1, 4), max_features= 1000, analyzer=u'word', min_df=2, max_df=0.85, sublinear_tf=True, use_idf = True, lowercase = False)

pipeline = Pipeline([
('union', FeatureUnion(
transformer_list=[

('vectorized_pipeline', Pipeline([
('union_vectorizer', FeatureUnion([

('stem_text', Pipeline([
('selector', ItemSelector(key='stem_text')),
('stem_tfidf', countVecWord)
])),

('pos_text', Pipeline([
('selector', ItemSelector(key='pos_text')),
('pos_tfidf', countVecWord_tags)
])),

])),
('percentile_feature_selection', fs_vect)
])),


('custom_pipeline', Pipeline([
('custom_features', FeatureUnion([

('pos_cluster', Pipeline([
('selector', ItemSelector(key='pos_text')),
('pos_cluster_inner', pos_cluster)
])),

('stylistic_features', Pipeline([
('selector', ItemSelector(key='raw_text')),
('stylistic_features_inner', stylistic_features)
])),


])),
('percentile_feature_selection', fs),
('inner_scale', inner_scaler)
])),

],

# weight components in FeatureUnion
# n_jobs=6,

transformer_weights={
'vectorized_pipeline': 0.8, # 0.8,
'custom_pipeline': 1.0 # 1.0
},
)),

('rfe_feature_selection', f5),
('clf', classifier),
])

我将尝试解释这些步骤。第一个管道由向量化器组成,称为“vectorized_pipeline”,所有这些都有一个函数“get_feature_names”。第二个管道包含我自己的功能,我也使用 fit、transform 和 get_feature_names 函数实现了它们。当我使用@Kevin 的建议时,我得到一个错误,即“union”(这是我在管道中的顶级元素的名称)没有 get_feature_names 函数:

support = pipeline.named_steps['rfe_feature_selection'].support_
feature_names = pipeline.named_steps['union'].get_feature_names()
print np.array(feature_names)[support]

此外,当我尝试从各个 FeatureUnion 获取特征名称时,如下所示:

support = pipeline.named_steps['rfe_feature_selection'].support_
feature_names = pipeline_age.named_steps['union_vectorizer'].get_feature_names()
print np.array(feature_names)[support]

我得到一个关键错误:

feature_names = pipeline.named_steps['union_vectorizer'].get_feature_names()
KeyError: 'union_vectorizer'

最佳答案

您可以访问 Pipeline 的每个步骤使用属性 named_steps,这里是 iris 数据集的示例,它只选择 2 特征,但解决方案将扩展。

from sklearn import datasets
from sklearn import feature_selection
from sklearn.svm import LinearSVC

iris = datasets.load_iris()
X = iris.data
y = iris.target

# classifier
LinearSVC1 = LinearSVC(tol=1e-4, C = 0.10000000000000001)
f5 = feature_selection.RFE(estimator=LinearSVC1, n_features_to_select=2, step=1)

pipeline = Pipeline([
('rfe_feature_selection', f5),
('clf', LinearSVC1)
])

pipeline.fit(X, y)

使用named_steps,您可以访问管道中转换对象的属性和方法。 RFE属性 support_(或方法 get_support())将返回所选功能的 bool 掩码:

support = pipeline.named_steps['rfe_feature_selection'].support_

现在 support 是一个数组,您可以使用它来有效地提取所选特征(列)的名称。确保您的特征名称在 numpy array 中,而不是 Python 列表。

import numpy as np
feature_names = np.array(iris.feature_names) # transformed list to array

feature_names[support]

array(['sepal width (cm)', 'petal width (cm)'],
dtype='|S17')

编辑

根据我上面的评论,这是删除了 CustomFeautures() 函数的示例:

from sklearn.pipeline import FeatureUnion, Pipeline
from sklearn import feature_selection
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.svm import LinearSVC
import numpy as np

X = ['I am a sentence', 'an example']
Y = [1, 2]
X_dev = ['another sentence']

# classifier
LinearSVC1 = LinearSVC(tol=1e-4, C = 0.10000000000000001)
f5 = feature_selection.RFE(estimator=LinearSVC1, n_features_to_select=500, step=1)

pipeline = Pipeline([
('features', FeatureUnion([
('tfidf', TfidfVectorizer(ngram_range=(1, 3), max_features= 4000))])),
('rfe_feature_selection', f5),
('clf', LinearSVC1),
])

pipeline.fit(X, Y)
y_pred = pipeline.predict(X_dev)

support = pipeline.named_steps['rfe_feature_selection'].support_
feature_names = pipeline.named_steps['features'].get_feature_names()
np.array(feature_names)[support]

关于python - 如何在 sklearn 管道中获取通过特征消除选择的特征名称?,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/36633460/

26 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com