python - 解释来自 RandomForestClassifier 的特征重要性值-6ren

python - 解释来自 RandomForestClassifier 的特征重要性值

转载作者：行者123 更新时间：2023-11-28 16:29:08

25

4

在机器学习方面，我是初学者，我无法解释我从第一个程序中获得的一些结果。这是设置:

我有一个书评数据集。这些书可以用大约 1600 本书中的任意数量的限定符来标记。评论这些书的人也可以用这些限定符来标记自己，以表明他们喜欢阅读带有该标签的东西。

数据集的每个限定符都有一列。对于每个评论，如果给定的限定符用于标记书籍和评论者，则记录值 1。如果给定评论的给定限定符没有“匹配”，则记录值 0。

还有一个“分数”列，其中包含每个评论的整数 1-5(该评论的“星级”)。我的目标是确定哪些特征对获得高分最重要。

这是我现在拥有的代码 ( https://gist.github.com/souldeux/99f71087c712c48e50b7 ):

def determine_feature_importance(df):
    #Determines the importance of individual features within a dataframe
    #Grab header for all feature values excluding score & ids
    features_list = df.columns.values[4::]
    print "Features List: \n", features_list

    #set X equal to all feature values, excluding Score & ID fields
    X = df.values[:,4::]

    #set y equal to all Score values
    y = df.values[:,0]

    #fit a random forest with near-default paramaters to determine feature importance
    print '\nCreating Random Forest Classifier...\n'
    forest = RandomForestClassifier(oob_score=True, n_estimators=10000)
    print '\nFitting Random Forest Classifier...\n'
    forest.fit(X,y)
    feature_importance = forest.feature_importances_
    print feature_importance

    #Make importances relative to maximum importance
    print "\nMaximum feature importance is currently: ", feature_importance.max()
    feature_importance = 100.0 * (feature_importance / feature_importance.max())
    print "\nNormalized feature importance: \n", feature_importance
    print "\nNormalized maximum feature importance: \n", feature_importance.max()
    print "\nTo do: set fi_threshold == max?"
    print "\nTesting: setting fi_threshhold == 1"
    fi_threshold=1

    #get indicies of all features over fi_threshold
    important_idx = np.where(feature_importance > fi_threshold)[0]
    print "\nRetrieved important_idx: ", important_idx

    #create a list of all feature names above fi_threshold
    important_features = features_list[important_idx]
    print "\n", important_features.shape[0], "Important features(>", fi_threshold, "% of max importance:\n", important_features

    #get sorted indices of important features
    sorted_idx = np.argsort(feature_importance[important_idx])[::-1]
    print "\nFeatures sorted by importance (DESC):\n", important_features[sorted_idx]

    #generate plot
    pos = np.arange(sorted_idx.shape[0]) + .5
    plt.subplot(1,2,2)
    plt.barh(pos,feature_importance[important_idx][sorted_idx[::-1]],align='center')
    plt.yticks(pos, important_features[sorted_idx[::-1]])
    plt.xlabel('Relative importance')
    plt.ylabel('Variable importance')
    plt.draw()
    plt.show()

    X = X[:, important_idx][:, sorted_idx]


    return "Feature importance determined"

我成功地生成了一个情节，但老实说我不确定情节的含义。据我了解，这向我展示了任何给定特征对分数变量的影响有多强烈。但是，我意识到这一定是一个愚蠢的问题，我怎么知道影响是积极的还是消极的？

最佳答案

简而言之，您没有。决策树(随机森林的组成部分)不是这样工作的。如果您使用线性模型，那么特征是“正面”还是“负面”就很简单了，因为它对最终结果的唯一影响是被添加(带有权重)。而已。然而，决策树的集合可以对每个特征有任意复杂的规则，例如“如果书有红色封面并且有超过 100 页，那么如果它包含龙，它会得到高分”但是“如果书有蓝色封面并且超过 100 页页面，然后如果它包含龙，它会得到低分”等等。

特征重要性仅让您了解哪些特征对决策有贡献，而不是“以哪种方式”，因为有时它会以这种方式起作用，有时会以另一种方式起作用。

你能做什么？您可以添加一些极端的简化 - 假设您只对完全没有其他功能的功能感兴趣，现在 - 一旦您知道哪些功能很重要，您就可以计算每个类别中此功能的次数(您的情况下的分数)。这样你就会得到分布

P(gets score X|has feature Y)

这将或多或少地告诉您它是否具有(在边缘化之后)积极或消极的影响。

关于python - 解释来自 RandomForestClassifier 的特征重要性值，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/33837125/

25

4

0

文章推荐： javascript - 这个 getJson 调用有什么问题

文章推荐： javascript - 简单的 JavaScript 检查不起作用？

文章推荐： html - CSS 变换/旋转动态宽度的文本，上面有线

python - RandomForestClassifier 为多标签类提供转置输出
由于某种原因，每当我运行 ensemble.RandomForestClassifier() 并使用 .predict_proba() 方法时，它都会返回一个形状为 [n_classes, n_sam
python - 尝试将分类数据转换为数字并运行 RandomForestClassifier
我正在测试这段代码。 df1 = df[['Group', 'Sector', 'Cat2', 'Cat3', 'Cat4', 'Cat5', 'Cat6', 'Industry', 'Market'
python - 增量拟合sklearn RandomForestClassifier
我正在使用一个在每次迭代时生成数据的环境。我想保留先前迭代中的模型并将新数据添加到现有模型中。我想了解模型拟合的工作原理。它将使新数据与现有模型相匹配，还是会使用新数据创建新模型。调用新数据的拟合
python - RandomForestClassifier 性能不佳
我编写了以下 Python 代码，用于在 UCI ML 存储库的 Forest CoverType 数据集上运行 RandomForestClassifier(使用默认参数设置)。然而，结果很差，准确
python - 'RandomForestClassifier' 对象没有属性 'tree_'
from sklearn.ensemble import RandomForestClassifier from sklearn import tree rf = RandomForestClassi
python - “RandomForestClassifier”对象没有属性 'layers'
我正在尝试攻击我的随机森林分类器。 clf = RandomForestClassifier(max_features="sqrt", n_estimators=500, n_jobs=-1, ver
scikit-learn RandomForestClassifier 概率预测与多数投票
在 section 1.9.2.1 中的 scikit-learn 文档中(摘录如下)，为什么随机森林的实现与 Breiman 的原始论文不同？据我所知，在聚合分类器的集合时，Breiman 选择了多
python - RandomForestClassifier 可视化 - 重叠颜色
我使用以下代码可视化 RandomForestClassifier 的结果: X, y = make_blobs(n_samples=300, centers=4,
python - scikit RandomForestClassifier - 真实结果与预测分数不匹配
我是机器学习新手，我正在尝试使用 scikit RandomForestClassifier 对文本进行分类。我遇到的问题是我的测试数据结果与 sklearn 分类报告不匹配。训练集大约有 25k 个
Python sklearn RandomForestClassifier 不可重现的结果
我一直在使用 sklearn 的随机森林，并且尝试比较几个模型。然后我注意到即使使用相同的种子，随机森林也会给出不同的结果。我尝试了两种方法:random.seed(1234) 以及使用随机森林内置的
python - 让 RandomForestClassifier 在训练期间确定选择一个变量
这是一个新手问题。我想使用 sklearn 中的 RandomForestClassifier 训练一个 Random Forest。我有几个变量，但在这些变量中，我希望算法在它训练的每一棵树中确定
python - 解释来自 RandomForestClassifier 的特征重要性值
在机器学习方面，我是初学者，我无法解释我从第一个程序中获得的一些结果。这是设置: 我有一个书评数据集。这些书可以用大约 1600 本书中的任意数量的限定符来标记。评论这些书的人也可以用这些限定符来标记
python - 拟合 RandomForestClassifier 时内存使用量激增
我正在尝试用中等大小的 numpy float 组来填充森林 In [3]: data.shape Out[3]: (401125, 5) [...] forest = forest.fit(data
scikit-learn RandomForestClassifier，停止工作，有关如何调试的建议
我正在 RandomForestClassifier 上进行网格搜索，我的代码一直在工作，直到我更改了功能，然后代码突然生成以下错误(在 classifier.fit 行) 我没有更改任何代码，只是将
scikit-learn RandomForestClassifier，停止工作，有关如何调试的建议
我正在 RandomForestClassifier 上进行网格搜索，我的代码一直在工作，直到我更改了功能，然后代码突然生成以下错误(在 classifier.fit 行) 我没有更改任何代码，只是将
scala - 在 Spark RandomForestClassifier 中预测类别概率
我使用 ml.classification.RandomForestClassifier 构建了随机森林模型。我试图从模型中提取预测概率，但我只看到了预测类而不是概率。根据这个issue link ，
python - 使用 sklearn RandomForestClassifier 进行分类
我正在使用 Scikit RandomForestClassifier 对不平衡数据进行分类。目标类数据为“1”或“0”(99% 的值为 0)。我想分配一个权重。我怎样才能做到这一点。我在文档中发
Python scikit-learn RandomForestClassifier 访问单个树以及如何保存它们
如何访问单个树并保存/加载 RandomForestClassifier 对象？我只想查看每棵树的结构以确定哪个特征是重要的。我想将经过训练的分类器对象保存在文件或数据库中。怎么做？最佳答案您基
python - RandomForestClassifier 实例尚未安装。在使用此方法之前使用适当的参数调用 'fit'
我正在尝试训练一个决策树模型，保存它，然后在我以后需要时重新加载它。但是，我不断收到以下错误: This DecisionTreeClassifier instance is not fitted y
python - 使用 scikit RandomForestClassifier 的平均降低精度使用哪个精度分数
我一直在运行此 website 上显示的“平均降低精度”度量的实现: 在示例中，作者使用的是随机森林回归器 RandomForestRegressor，但我使用的是随机森林分类器 RandomFore

首页

博学

6Ren·AI

商城

python - 解释来自 RandomForestClassifier 的特征重要性值