gpt4 book ai didi

r - 有没有办法在 R(或 ml3)的食谱包中对行(尤其是虚拟变量)进行分组

转载 作者:行者123 更新时间:2023-12-05 02:51:05 24 4
gpt4 key购买 nike

# Packages
library(dplyr)
library(recipes)

# toy dataset, with A being multicolored
df <- tibble(name = c("A", "A", "A", "B", "C"), color = c("green", "yellow", "purple", "green", "blue"))


#> # A tibble: 5 x 2
#> name color
#> <chr> <chr>
#> 1 A green
#> 2 A yellow
#> 3 A purple
#> 4 B green
#> 5 C blue

食谱步骤效果很好

dummified_df <- recipe(. ~ ., data = df) %>%
step_dummy(color, one_hot = TRUE) %>%
prep(training = df) %>%
juice()


#> # A tibble: 5 x 5
#> name color_blue color_green color_purple color_yellow
#> <fct> <dbl> <dbl> <dbl> <dbl>
#> 1 A 0 1 0 0
#> 2 A 0 0 0 1
#> 3 A 0 0 1 0
#> 4 B 0 1 0 0
#> 5 C 1 0 0 0

但我真正想要获得的结果是下面的结果,每行一个观察结果,因为彩色项目不再需要多行。

summarized_dummified_df <- dummified_df %>% 
group_by(name) %>%
summarise_all(~ifelse(max(.) > 0, 1, 0)) %>%
ungroup()


#> # A tibble: 3 x 5
#> name color_blue color_green color_purple color_yellow
#> <fct> <dbl> <dbl> <dbl> <dbl>
#> 1 A 0 1 1 1
#> 2 B 0 1 0 0
#> 3 C 1 0 0 0

显然,我可以这样做。但是为了将我的配方步骤完全集成到 tidymodels 生态系统中,例如与工作流,如果我可以直接对不需要重复的行进行分组会更好,这要归功于虚拟变量在食谱里面。

是否有任何tidymodels-sanctioned方法来获得这个结果?


我也尝试用 mlr3 来做这件事,但没有成功,因为我找不到任何合适的 PipeOp 来聚合行。

library("mlr3")
library("mlr3pipelines")


task = TaskClassif$new("task",
data.table::data.table(
name = c("A", "A", "A", "B", "C"),
color = as.factor(c("green", "yellow", "purple", "green", "blue")),
price = as.factor(c("low", "low", "low", "high", "low"))),
"price"
)

poe = po("encode")

poe$train(list(task))[[1]]$data()

#> price name color.blue color.green color.purple color.yellow
#> 1: low A 0 1 0 0
#> 2: low A 0 0 0 1
#> 3: low A 0 0 1 0
#> 4: high B 0 1 0 0
#> 5: low C 1 0 0 0

我正在研究 custom step_ 的创建过程函数或 custom PipeOp但我仍然觉得我遗漏了什么,因为我觉得我的数据类型对我来说并不罕见。

最佳答案

在我所见的任何地方,虚拟变量或指示变量在概念上都被映射为一对一,而不是一对多,我认为这就是您遇到此问题的原因。不过,像你一样,我也想在现实世界中的某个时候将它们一对多映射。我通常在开始模型预处理工作流之前在数据整理步骤中执行此操作,如下所示:

library(tidyverse)

# toy dataset, with A being multicolored
df <- tibble(name = c("A", "A", "A", "B", "C"), color = c("green", "yellow", "purple", "green", "blue"))

df %>%
mutate(value = 1) %>%
pivot_wider(names_from = "color", names_prefix = "color_", values_from = "value", values_fill = 0)
#> # A tibble: 3 x 5
#> name color_green color_yellow color_purple color_blue
#> <chr> <dbl> <dbl> <dbl> <dbl>
#> 1 A 1 1 1 0
#> 2 B 1 0 0 0
#> 3 C 0 0 0 1

reprex package 创建于 2020-08-18 (v0.3.0.9001)

关于r - 有没有办法在 R(或 ml3)的食谱包中对行(尤其是虚拟变量)进行分组,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/63372731/

24 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com