gpt4 book ai didi

R tidymodels/VIP变量重要性确定

转载 作者:行者123 更新时间:2023-12-03 08:20:58 27 4
gpt4 key购买 nike

通过 tidymodels 和 R 中的 vip 包,我计算了变量重要性。就代码而言,它看起来像这样:

rf_vi_fit %>%
pull_workflow_fit() %>%
vip(geom = "point") +
labs(title = "Random forest variable importance")

从视觉上看,它看起来像这样:

Random forest variable importance

但是,变量重要性实际上意味着什么?变量重要性可以基于多个指标,例如 R 平方增益或基尼损失,但我不确定 vip 的变量重要性是基于哪里。我的其他预测具有 3 到 4 左右的可变重要性值,而不是像本模型中那样的 0.005。

我在 vip() 文档中也找不到变量重要性的依据。

最佳答案

您的询问的答案位于vip文档的各个部分https://cran.r-project.org/web/packages/vip/vip.pdf .

vip() 函数是 vi() 的包装器,用于绘制变量重要性分数。在 vip() 文档中,... 参数是“要传递给 vi() 的其他可选参数”。

vi()函数中,有一个名为method的参数。

method = c("model", "firm", "permute", "shap")
Character string specifying the type of variable importance (VI) to compute. Current options are:
"model" (the default), for model-specific VI scores (see vi_model() for details).
"firm", for variance-based VI scores (see vi_firm() fordetails).
"permute", for permutation-based VI scores (see vi_permute for details).
"shap", for Shapley-based VI scores.
For more details on the variance-based methods, see Greenwell et al. (2018) and Scholbeck et al. (2019).

然后,如果您查看vi_models()的文档,它详细描述了每种模型类型的特定于模型的VI分数。以下是描述 RandomForest 模型特定重要性的摘录。

Random forests typically provide two measures of variable importance.
The first measure is computed from permuting out-of-bag (OOB) data: for each tree, the prediction error on the OOB portion of the data is recorded (error rate for classification and MSE for regression). Then the same is done after permuting each predictor variable. The difference between the two are then averaged over all trees in the forest, and normalized by the standard deviation of the differences. If the standard deviation of the differences is equal to 0 for a variable, the division is not done (but the average is almost always equal to 0 in that case). See importance for details, including additional arguments that can be passed via the ... argument.
The second measure is the total decrease in node impurities from splitting on the variable, averaged over all trees. For classification, the node impurity is measured by the Gini index. For regression, it is measured by residual sum of squares. See importance for details.

关于R tidymodels/VIP变量重要性确定,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/67833723/

27 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com