gpt4 book ai didi

python - RandomForest 和 XGB 为什么/如何?有什么办法可以解决这个问题吗?

转载 作者:行者123 更新时间:2023-12-04 04:14:31 26 4
gpt4 key购买 nike

从树解释器的 .shap_values(some_data) 返回的 SHAP 值为 XGB 和随机森林提供不同的维度/结果。我试过研究它,但似乎无法在 Slundberg(SHAP dude)的任何教程中找到原因或方法,或解释。所以:

  • 有没有我失踪的原因?
  • 是否有一些标志可以像其他不明显或我遗漏的模型一样返回每个类的 XGB 的 shap 值?

  • 下面是一些示例代码!
    import xgboost.sklearn as xgb
    import pandas as pd
    from sklearn.ensemble import RandomForestClassifier
    from sklearn.datasets import load_breast_cancer
    from sklearn.model_selection import train_test_split
    import shap

    bc = load_breast_cancer()
    cancer_df = pd.DataFrame(bc['data'], columns=bc['feature_names'])
    cancer_df['target'] = bc['target']
    cancer_df = cancer_df.iloc[0:50, :]
    target = cancer_df['target']
    cancer_df.drop(['target'], inplace=True, axis=1)

    X_train, X_test, y_train, y_test = train_test_split(cancer_df, target, test_size=0.33, random_state = 42)

    xg = xgb.XGBClassifier()
    xg.fit(X_train, y_train)
    rf = RandomForestClassifier()
    rf.fit(X_train, y_train)

    xg_pred = xg.predict(X_test)
    rf_pred = rf.predict(X_test)

    rf_explainer = shap.TreeExplainer(rf, X_train)
    xg_explainer = shap.TreeExplainer(xg, X_train)

    rf_vals = rf_explainer.shap_values(X_train)
    xg_vals = xg_explainer.shap_values(X_train)

    print('Random Forest')
    print(type(rf_vals))
    print(type(rf_vals[0]))
    print(rf_vals[0].shape)
    print(rf_vals[1].shape)

    print('XGBoost')
    print(type(xg_vals))
    print(xg_vals.shape)

    输出:
    Random Forest
    <class 'list'>
    <class 'numpy.ndarray'>
    (33, 30)
    (33, 30)
    XGBoost
    <class 'numpy.ndarray'>
    (33, 30)

    任何想法都有帮助!谢谢!

    最佳答案

    对于二进制分类:

  • XGBClassifier 的 SHAP 值(sklearn API) 是 1 的原始值类(一维)
  • RandomForestClassifier 的 SHAP 值是 0 的概率和 1类(二维)。

  • 演示
    from xgboost import XGBClassifier
    from sklearn.ensemble import RandomForestClassifier
    from sklearn.datasets import load_breast_cancer
    from sklearn.model_selection import train_test_split
    from shap import TreeExplainer
    from scipy.special import expit

    X, y = load_breast_cancer(return_X_y=True)
    X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=42)

    xgb = XGBClassifier(
    max_depth=5, n_estimators=100, eval_metric="logloss", use_label_encoder=False
    ).fit(X_train, y_train)
    xgb_exp = TreeExplainer(xgb)
    xgb_sv = np.array(xgb_exp.shap_values(X_test))
    xgb_ev = np.array(xgb_exp.expected_value)

    print("Shape of XGB SHAP values:", xgb_sv.shape)

    rf = RandomForestClassifier(max_depth=5, n_estimators=100).fit(X_train, y_train)
    rf_exp = TreeExplainer(rf)
    rf_sv = np.array(rf_exp.shap_values(X_test))
    rf_ev = np.array(rf_exp.expected_value)

    print("Shape of RF SHAP values:", rf_sv.shape)
    Shape of XGB SHAP values: (143, 30)
    Shape of RF SHAP values: (2, 143, 30)

    Interpretaion:

    • XGBoost (143,30) dimensions:
      • 143: number of samples in test
      • 30: number of features
    • RF (2,143,30) dimensions:
      • 2: number of output classes
      • 143: number of samples
      • 30: number of features

    比较 xgboost SHAP 值到预测概率,因此类,您可以尝试将 SHAP 值添加到基本(预期)值。对于测试中的第 0 个数据点,它将是:
    xgb_pred = expit(xgb_sv[0,:].sum() + xgb_ev)
    assert np.isclose(xgb_pred, xgb.predict_proba(X_test)[0,1])
    比较 RF SHAP 值到第 0 个数据点的预测概率:
    rf_pred = rf_sv[1,0,:].sum() + rf_ev[1]
    assert np.isclose(rf_pred, rf.predict_proba(X_test)[0,1])
    请注意,此分析适用于 (i) sklearn API 和 (ii) 二进制分类。

    关于python - RandomForest 和 XGB 为什么/如何?有什么办法可以解决这个问题吗?,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/61004438/

    26 4 0
    Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
    广告合作:1813099741@qq.com 6ren.com