gpt4 book ai didi

r - 为什么 tidymodels/recipes 中的 "id variable"会起到预测作用?

转载 作者:行者123 更新时间:2023-12-05 08:31:12 24 4
gpt4 key购买 nike

这与 Predict with step_naomit and retain ID using tidymodels 是同一个问题,但即使有一个可接受的答案,OP 的最后评论指出了“id 变量”被用作预测变量的问题,正如在查看 model$fit$variable.importance 时所见.

我有一个包含“id 变量”的数据集,我想保留。我认为我可以通过 recipe() 规范实现这一点。

library(tidymodels)

# label is an identifier variable I want to keep even though it's not
# a predictor
df <- tibble(label = 1:50,
x = rnorm(50, 0, 5),
f = factor(sample(c('a', 'b', 'c'), 50, replace = TRUE)),
y = factor(sample(c('Y', 'N'), 50, replace = TRUE)) )

df_split <- initial_split(df, prop = 0.70)

# Make up any recipe: just note I specify 'label' as "id variable"
rec <- recipe(training(df_split)) %>%
update_role(label, new_role = "id variable") %>%
update_role(y, new_role = "outcome") %>%
update_role(x, new_role = "predictor") %>%
update_role(f, new_role = "predictor") %>%
step_corr(all_numeric(), -all_outcomes()) %>%
step_dummy(all_predictors(),-all_numeric()) %>%
step_meanimpute(all_numeric(), -all_outcomes())

train_juiced <- prep(rec, training(df_split)) %>% juice()

logit_fit <- logistic_reg(mode = "classification") %>%
set_engine(engine = "glm") %>%
fit(y ~ ., data = train_juiced)

# Why is label a variable in the model ?
logit_fit[['fit']][['coefficients']]
#> (Intercept) label x f_b f_c
#> 1.03664140 -0.01405316 0.22357266 -1.80701531 -1.66285399

reprex package 创建于 2020-01-27 (v0.3.0)

但即使我确实指定 label 是一个 id 变量,它仍被用作预测变量。所以也许我可以在公式中使用我想要的特定术语,并专门添加 label 作为 id 变量。

rec <- recipe(training(df_split), y ~ x + f) %>% 
update_role(label, new_role = "id variable") %>%
step_corr(all_numeric(), -all_outcomes()) %>%
step_dummy(all_predictors(),-all_numeric()) %>%
step_meanimpute(all_numeric(), -all_outcomes())
#> Error in .f(.x[[i]], ...): object 'label' not found

reprex package 创建于 2020-01-27 (v0.3.0)

我可以尝试不提及 label

rec <- recipe(training(df_split), y ~ x + f) %>% 
step_corr(all_numeric(), -all_outcomes()) %>%
step_dummy(all_predictors(),-all_numeric()) %>%
step_meanimpute(all_numeric(), -all_outcomes())


train_juiced <- prep(rec, training(df_split)) %>% juice()

logit_fit <- logistic_reg(mode = "classification") %>%
set_engine(engine = "glm") %>%
fit(y ~ ., data = train_juiced)

# Why is label a variable in the model ?
logit_fit[['fit']][['coefficients']]
#> (Intercept) x f_b f_c
#> -0.98950228 0.03734093 0.98945339 1.27014824

train_juiced
#> # A tibble: 35 x 4
#> x y f_b f_c
#> <dbl> <fct> <dbl> <dbl>
#> 1 -0.928 Y 1 0
#> 2 4.54 N 0 0
#> 3 -1.14 N 1 0
#> 4 -5.19 N 1 0
#> 5 -4.79 N 0 0
#> 6 -6.00 N 0 0
#> 7 3.83 N 0 1
#> 8 -8.66 Y 1 0
#> 9 -0.0849 Y 1 0
#> 10 -3.57 Y 0 1
#> # ... with 25 more rows

reprex package 创建于 2020-01-27 (v0.3.0)

好的,模型可以用了,但是我的标签不见了。
我应该怎么做?

最佳答案

您遇到的主要问题/概念性问题是,一旦您juice() 配方,它就只是数据,即字面上只是一个数据框。当您使用它来拟合模型时,模型无法知道某些变量具有特殊作用。

library(tidymodels)

# label is an identifier variable to keep even though it's not a predictor
df <- tibble(label = 1:50,
x = rnorm(50, 0, 5),
f = factor(sample(c('a', 'b', 'c'), 50, replace = TRUE)),
y = factor(sample(c('Y', 'N'), 50, replace = TRUE)) )

df_split <- initial_split(df, prop = 0.70)

rec <- recipe(y ~ ., training(df_split)) %>%
update_role(label, new_role = "id variable") %>%
step_corr(all_numeric(), -all_outcomes()) %>%
step_dummy(all_predictors(),-all_numeric()) %>%
step_meanimpute(all_numeric(), -all_outcomes()) %>%
prep()

train_juiced <- juice(rec)
train_juiced
#> # A tibble: 35 x 5
#> label x y f_b f_c
#> <int> <dbl> <fct> <dbl> <dbl>
#> 1 1 1.80 N 1 0
#> 2 3 1.45 N 0 0
#> 3 5 -5.00 N 0 0
#> 4 6 -4.15 N 1 0
#> 5 7 1.37 Y 0 1
#> 6 8 1.62 Y 0 1
#> 7 10 -1.77 Y 1 0
#> 8 11 -3.15 N 0 1
#> 9 12 -2.02 Y 0 1
#> 10 13 2.65 Y 0 1
#> # … with 25 more rows

请注意,train_juiced 只是一个普通的小标题。如果您使用 fit() 在这个 tibble 上训练一个模型,它不会知道任何关于用于转换数据的方法。

tidymodels 框架确实有一种方法可以使用配方中的角色信息来训练模型。可能最简单的方法是使用 workflows .

logit_spec <- logistic_reg(mode = "classification") %>%
set_engine(engine = "glm")

wf <- workflow() %>%
add_model(logit_spec) %>%
add_recipe(rec)

logit_fit <- fit(wf, training(df_split))

# No more label in the model
logit_fit
#> ══ Workflow [trained] ══════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════
#> Preprocessor: Recipe
#> Model: logistic_reg()
#>
#> ── Preprocessor ────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
#> 3 Recipe Steps
#>
#> ● step_corr()
#> ● step_dummy()
#> ● step_meanimpute()
#>
#> ── Model ───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
#>
#> Call: stats::glm(formula = formula, family = stats::binomial, data = data)
#>
#> Coefficients:
#> (Intercept) x f_b f_c
#> 0.42331 -0.04234 -0.04991 0.64728
#>
#> Degrees of Freedom: 34 Total (i.e. Null); 31 Residual
#> Null Deviance: 45
#> Residual Deviance: 44.41 AIC: 52.41

reprex package 创建于 2020-02-15 (v0.3.0)

模型中没有更多标签!

关于r - 为什么 tidymodels/recipes 中的 "id variable"会起到预测作用?,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/59941616/

24 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com