gpt4 book ai didi

python - 如何使用 scikit-learn 中的信息增益度量选择 Dataframe 中的最佳特征

转载 作者:行者123 更新时间:2023-12-03 23:43:27 24 4
gpt4 key购买 nike

关闭。这个问题需要debugging details .它目前不接受答案。












想改善这个问题吗?更新问题,使其成为 on-topic对于堆栈溢出。

去年关闭。




Improve this question




我想使用信息增益度量(scikit-learn 中的互信息)确定数据帧的 10 个最佳特征,并将它们显示在表格中(根据信息增益获得的分数按升序排列)。
在本例中,features是包含所有有趣的训练数据的数据框,可以判断餐厅是否会关闭。

# Initialization of data and labels
x = features.copy () # "x" contains all training data
y = x ["closed"] # "y" contains the labels of the records in "x"

# Elimination of the class column (closed) of features
x = x.drop ('closed', axis = 1)

# this is x.columns, sorry for the mix french and english
features_columns = ['moyenne_etoiles', 'ville', 'zone', 'nb_restaurants_zone',
'zone_categories_intersection', 'ville_categories_intersection',
'nb_restaurant_meme_annee', 'ecart_type_etoiles', 'tendance_etoiles',
'nb_avis', 'nb_avis_favorables', 'nb_avis_defavorables',
'ratio_avis_favorables', 'ratio_avis_defavorables',
'nb_avis_favorables_mention', 'nb_avis_defavorables_mention',
'nb_avis_favorables_elites', 'nb_avis_defavorables_elites',
'nb_conseils', 'nb_conseils_compliment', 'nb_conseils_elites',
'nb_checkin', 'moyenne_checkin', 'annual_std', 'chaine',
'nb_heures_ouverture_semaine', 'ouvert_samedi', 'ouvert_dimanche',
'ouvert_lundi', 'ouvert_vendredi', 'emporter', 'livraison',
'bon_pour_groupes', 'bon_pour_enfants', 'reservation', 'prix',
'terrasse']

# normalization
std_scale = preprocessing.StandardScaler().fit(features[features_columns])
normalized_data = std_scale.transform(features[features_columns])
labels = np.array(features['closed'])

# split the data
train_features, test_features, train_labels, test_labels = train_test_split(normalized_data, labels, test_size = 0.2, random_state = 42)

labels_true = ?
labels_pred = ?

# I dont really know how to use this function to achieve what i want
from sklearn.feature_selection import mutual_info_classif
from sklearn.datasets import make_classification



# Get the mutual information coefficients and convert them to a data frame
coeff_df =pd.DataFrame(features,
columns=['Coefficient'], index=x.columns)

coeff_df.head()


使用相互信息分数来实现这一目标的正确语法是什么?

最佳答案

调整后的_ mutual_info_score将真实标签与来自分类器的标签预测进行比较。两个标签数组必须具有相同的形状 (nsamples,)。
您需要 Scikit-Learn 的 mutual_info_classif为了你想要达到的目标。将特征数组和对应的标签传递给mutual_info_classif,得到每个特征与目标之间估计的互信息。

import numpy as np
import pandas as pd

from sklearn.feature_selection import mutual_info_classif
from sklearn.datasets import make_classification

# Generate a sample data frame
X, y = make_classification(n_samples=1000, n_features=4,
n_informative=2, n_redundant=2,
random_state=0, shuffle=False)
feature_columns = ['A', 'B', 'C', 'D']
features = pd.DataFrame(X, columns=feature_columns)

# Get the mutual information coefficients and convert them to a data frame
coeff_df =pd.DataFrame(mutual_info_classif(X, y).reshape(-1, 1),
columns=['Coefficient'], index=feature_columns)
输出
features.head(3)
Out[43]:
A B C D
0 -1.668532 -1.299013 0.799353 -1.559985
1 -2.972883 -1.088783 1.953804 -1.891656
2 -0.596141 -1.370070 -0.105818 -1.213570

# Displaying only the top two features. Adjust the number as required.
coeff_df.sort_values(by='Coefficient', ascending=False)[:2]

Out[44]:
Coefficient
B 0.523911
D 0.366884

关于python - 如何使用 scikit-learn 中的信息增益度量选择 Dataframe 中的最佳特征,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/64343345/

24 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com