gpt4 book ai didi

python - 如何在Pythons Scikit-Learn中组合多种特征选择方法

转载 作者:太空宇宙 更新时间:2023-11-03 20:29:55 34 4
gpt4 key购买 nike

我有一个包含超过 100k 行和 1000 列/特征以及一个输出(0 和 1)的数据集。我想为我的模型选择最好的特征/列。我正在考虑在 scikit-learn 中组合多种特征选择方法,但我不知道这是否是正确的过程,或者是否是正确的做法。另外,您会在下面的代码中看到,当我使用 pca 时,它表示列 f1 是最重要的功能,最后它表示我应该使用第 2 列(功能 f2),为什么会发生这种情况,这是好的/正确的/正常的吗?请参阅下面的代码,我为此使用了虚拟数据:

import pandas as pd

from sklearn.feature_selection import RFE, SelectFromModel
from sklearn.decomposition import PCA
from sklearn.model_selection import train_test_split

from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.svm import SVC


df = pd.DataFrame({'f1':[1,5,3,4,5,16,3,1,0],
'f2':[0.1,0.5,0.3,0.4,0.5,1.6,0.3,0.1,1],
'f3':[12,41,53,13,53,13,65,24,21],
'f4':[1,6,3,4,4,18,5,2,5],
'f5':[10,15,32,41,51,168,27,13,2],
'result':[1,0,1,0,0,0,1,1,0]})

print(df)

x = df.iloc[:,:-1]
y = df.iloc[:,-1]

# Printing the shape of my data before PCA
print(x.shape)

# Doing PCA to reduce number of features
pca = PCA()
fit = pca.fit(x)

pca_result = list(fit.explained_variance_ratio_)
print(pca_result)

#I see that 'f1', 'f2' and 'f3' are the most important values
#so now, my x is:
x = df[['f1', 'f2', 'f3']]
print(x.shape) #new shape of x

x_train, x_test, y_train, y_test = train_test_split(x, y, test_size = 0.3, random_state = 0)

classifiers = [['Linear SVM', SVC(kernel = 'linear', gamma = 'scale')],
['Decission tree', DecisionTreeClassifier()],
['Random Forest', RandomForestClassifier(n_estimators = 100)]]


# now i use 'SelectFromModel' so that I can get the optimal number of features/columns
my_acc = 0
for c in classifiers:

clf = c[1].fit(x_train, y_train)

model = SelectFromModel(clf, prefit=True)
model_score = clf.score(x_test, y_test)
column_res = model.transform(x_train).shape
print(model_score, column_res)
if model_score > my_acc:

my_acc = model_score
column_res = model.transform(x_train).shape
number_of_columns = column_res[1]
my_cls = c[0]

# classifier with the best accuracy and his number of columns is:
print(my_cls)
print('Number of columns',number_of_columns)


#Can I call 'RFE' now, is it correct / good / right thing to do?
# I want to find the best column for this
my_acc = 0
for c in classifiers:

model = c[1]
rfe = RFE(model, number_of_columns)
fit = rfe.fit(x_train, y_train)
acc = fit.score(x_test, y_test)

if acc > my_acc:
my_acc = acc
list_of_results = fit.support_

final_model_name = c[0]
final_model = c[1]

print()

print(c[0])
print(my_acc)
print(list_of_results)

#I got the result that says that I should use second column, and In the PCA it says that first column is the most important
#Is this good / normal / correct?

这是正确的方法,还是我做错了什么?

最佳答案

解释你的代码:

pca = PCA()
fit = pca.fit(x)

pca 将保留您的所有功能:要保留的组件数量。如果未设置 n_components,则保留所有组件

命令:

pca_result = list(fit.explained_variance_ratio_)

这篇文章解释得很好:Python scikit learn pca.explained_variance_ratio_ cutoff

你应该使用:

fit.explained_variance_ratio_.cumsum()

因为输出是您要保留的每个维度的方差(以百分比为单位)。使用主成分分析来衡量特征重要性是错误的。

只有带有SelectModel的部分对于特征选择才有意义。您可以在第一步中运行 SelectModel,然后使用 PCA 进一步降维,但如果您有足够的内存来运行它,则无需降维。

关于python - 如何在Pythons Scikit-Learn中组合多种特征选择方法,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/57571312/

34 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com