gpt4 book ai didi

python - 如何在 python 中的多类分类问题上获取每个类的 SHAP 值

转载 作者:行者123 更新时间:2023-12-05 02:30:44 28 4
gpt4 key购买 nike

我有以下数据框:

import pandas as pd
import random

import xgboost
import shap

foo = pd.DataFrame({'id':[1,2,3,4,5,6,7,8,9,10],
'var1':random.sample(range(1, 100), 10),
'var2':random.sample(range(1, 100), 10),
'var3':random.sample(range(1, 100), 10),
'class': ['a','a','a','a','a','b','b','c','c','c']})

我想为此运行分类算法以预测 3 个 classes

所以我将我的数据集分成训练和测试,然后运行 ​​xgboost

cl_cols = foo.filter(regex='var').columns
X_train, X_test, y_train, y_test = train_test_split(foo[cl_cols],
foo[['class']],
test_size=0.33, random_state=42)


model = xgboost.XGBClassifier(objective="binary:logistic")
model.fit(X_train, y_train)

现在我想获取每个类的平均 SHAP 值,而不是从此代码生成的绝对 SHAP 值的平均值:

shap_values = shap.TreeExplainer(model).shap_values(X_test)
shap.summary_plot(shap_values, X_test)

enter image description here

此外,该图将 class 标记为 0,1,2。我怎么知道 0,1 和 2 对应于原始的哪个 class

因为这段代码:

shap.summary_plot(shap_values, X_test,
class_names= ['a', 'b', 'c'])

给予

enter image description here

和这段代码:

shap.summary_plot(shap_values, X_test,
class_names= ['b', 'c', 'a'])

给予

enter image description here

所以我不再确定这个传说了。有什么想法吗?

最佳答案

通过做一些研究并在 this post 的帮助下和@Alessandro Nesti 的回答,这是我的解决方案:

foo = pd.DataFrame({'id':[1,2,3,4,5,6,7,8,9,10],
'var1':random.sample(range(1, 100), 10),
'var2':random.sample(range(1, 100), 10),
'var3':random.sample(range(1, 100), 10),
'class': ['a','a','a','a','a','b','b','c','c','c']})

cl_cols = foo.filter(regex='var').columns
X_train, X_test, y_train, y_test = train_test_split(foo[cl_cols],
foo[['class']],
test_size=0.33, random_state=42)


model = xgboost.XGBClassifier(objective="multi:softmax")
model.fit(X_train, y_train)

def get_ABS_SHAP(df_shap,df):
#import matplotlib as plt
# Make a copy of the input data
shap_v = pd.DataFrame(df_shap)
feature_list = df.columns
shap_v.columns = feature_list
df_v = df.copy().reset_index().drop('index',axis=1)

# Determine the correlation in order to plot with different colors
corr_list = list()
for i in feature_list:
b = np.corrcoef(shap_v[i],df_v[i])[1][0]
corr_list.append(b)
corr_df = pd.concat([pd.Series(feature_list),pd.Series(corr_list)],axis=1).fillna(0)

# Make a data frame. Column 1 is the feature, and Column 2 is the correlation coefficient
corr_df.columns = ['Variable','Corr']
corr_df['Sign'] = np.where(corr_df['Corr']>0,'red','blue')

shap_abs = np.abs(shap_v)
k=pd.DataFrame(shap_abs.mean()).reset_index()
k.columns = ['Variable','SHAP_abs']
k2 = k.merge(corr_df,left_on = 'Variable',right_on='Variable',how='inner')
k2 = k2.sort_values(by='SHAP_abs',ascending = True)

k2_f = k2[['Variable', 'SHAP_abs', 'Corr']]
k2_f['SHAP_abs'] = k2_f['SHAP_abs'] * np.sign(k2_f['Corr'])
k2_f.drop(columns='Corr', inplace=True)
k2_f.rename(columns={'SHAP_abs': 'SHAP'}, inplace=True)

return k2_f

foo_all = pd.DataFrame()

for k,v in list(enumerate(model.classes_)):

foo = get_ABS_SHAP(shap_values[k], X_test)
foo['class'] = v
foo_all = pd.concat([foo_all,foo])

import plotly_express as px
px.bar(foo_all,x='SHAP', y='Variable', color='class')

结果为 enter image description here

关于python - 如何在 python 中的多类分类问题上获取每个类的 SHAP 值,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/71753428/

28 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com