r - mlr3 PipeOps : Create branches with different data transformations and benchmark different learners within and between branches-6ren

r - mlr3 PipeOps : Create branches with different data transformations and benchmark different learners within and between branches

转载作者：行者123 更新时间：2023-12-04 04:14:31

我想使用 PipeOp 来训练学习者进行数据集的三种可选转换:

没有转换。
类(class)平衡。
类(class)平衡。

然后，我想对三个学习模型进行基准测试。

我的想法是按如下方式设置管道:

制作管道:输入 -> 估算数据集(可选)-> 分支 -> 拆分为上述三个分支 -> 在每个分支中添加学习器 -> 取消分支。
训练管道并希望(这就是我弄错的地方)结果会为每个分支中的每个学习者保存。

不幸的是，遵循这些步骤会导致单个学习器似乎“合并”了来自不同分支的所有内容。我希望得到一个长度为 3 的列表，但我得到的却是一个长度为 1 的列表。

R代码:

library(data.table)
library(paradox)
library(mlr3)
library(mlr3filters)
library(mlr3learners)
library(mlr3misc)
library(mlr3pipelines)
library(mlr3tuning)
library(mlr3viz)

learner <- lrn("classif.rpart", predict_type = "prob")
learner$param_set$values <- list(
  cp = 0,
  maxdepth = 21,
  minbucket = 12,
  minsplit = 24
)

graph = 
  po("imputehist") %>>%
  po("branch", c("nop", "classbalancing_up", "classbalancing_down")) %>>%
  gunion(list(
    po("nop", id = "null"),
    po("classbalancing", id = "classbalancing_down", ratio = 2, reference = 'minor'), 
    po("classbalancing", id = "classbalancing_up", ratio = 2, reference = 'major')
  )) %>>%
  gunion(list(
    po("learner", learner, id = "learner_null"),
    po("learner", learner, id = "learner_classbalancing_down"),
    po("learner", learner, id = "learner_classbalancing_up")
  )) %>>%
  po("unbranch")

plot(graph)

tr <- mlr3::resample(tsk("iris"), graph, rsmp("holdout"))

tr$learners

问题 1我怎样才能得到三个不同的结果？

问题 2在取消分支后，如何在管道中对这三个结果进行基准测试？

问题 3如果我想在每个分支中添加多个学习者怎么办？我希望一些学习器插入固定的超参数，而对于其他学习器，我希望在每个分支中使用 AutoTuner 调整他们的超参数。然后，我想在每个分支中对它们进行基准测试，并从每个分支中选择“最佳”。最后，我想对三个最好的学习器进行基准测试，以得出一个最好的学习器。

非常感谢。

最佳答案

我想我已经找到了我正在寻找的答案。简而言之，我想做的是:

创建具有多个学习器的图形管道。我希望一些学习者被插入固定的超参数，而对于其他学习者，我希望调整他们的超参数。然后，我想对它们进行基准测试并选择“最佳”的。我还希望在不同的类(class)平衡策略下对学习者进行基准测试，即什么都不做、上采样和下采样。上/下采样的最佳参数设置(例如比率)也将在调整期间确定。

下面的两个例子，一个几乎做我想做的，另一个完全做我想做的。

示例1:构建一个包括所有学习器的管道，即具有固定超参数的学习器，以及超参数需要调整的学习器

正如将要展示的那样，同时拥有两种学习器(即具有固定和可调超参数)似乎不是一个好主意，因为调整管道会忽略具有可调超参数的学习器。

####################################################################################
# Build Machine Learning pipeline that:
# 1. Imputes missing values (optional).
# 2. Tunes and benchmarks a range of learners.
# 3. Handles imbalanced data in different ways.
# 4. Identifies optimal learner for the task at hand.

# Abbreviations
# 1. td: Tuned. Learner already tuned with optimal hyperparameters, as found empirically by Probst et al. (2009). See http://jmlr.csail.mit.edu/papers/volume20/18-444/18-444.pdf
# 2. tn: Tuner. Optimal hyperparameters for the learner to be determined within the Tuner.
# 3. raw: Raw dataset in that class imbalances were not treated in any way.
# 4. up: Data upsampling to balance class imbalances.
# 5. down: Data downsampling to balance class imbalances.

# References
# Probst et al. (2009). http://jmlr.csail.mit.edu/papers/volume20/18-444/18-444.pdf
####################################################################################

task <- tsk('sonar')

# Indices for splitting data into training and test sets
train.idx <- task$data() %>%
  select(Class) %>%
  rownames_to_column %>%
  group_by(Class) %>%
  sample_frac(2 / 3) %>% # Stratified sample to maintain proportions between classes.
  ungroup %>%
  select(rowname) %>%
  deframe %>%
  as.numeric
test.idx <- setdiff(seq_len(task$nrow), train.idx)

# Define training and test sets in task format
task_train <- task$clone()$filter(train.idx)
task_test  <- task$clone()$filter(test.idx)

# Define class balancing strategies
class_counts <- table(task_train$truth())
upsample_ratio <- class_counts[class_counts == max(class_counts)] / 
  class_counts[class_counts == min(class_counts)]
downsample_ratio <- 1 / upsample_ratio

# 1. Enrich minority class by factor 'ratio'
po_over <- po("classbalancing", id = "up", adjust = "minor", 
              reference = "minor", shuffle = FALSE, ratio = upsample_ratio)

# 2. Reduce majority class by factor '1/ratio'
po_under <- po("classbalancing", id = "down", adjust = "major", 
               reference = "major", shuffle = FALSE, ratio = downsample_ratio)

# 3. No class balancing
po_raw <- po("nop", id = "raw") # Pipe operator for 'do nothing' ('nop'), i.e. don't up/down-balance the classes.

# We will be using an XGBoost learner throughout with different hyperparameter settings.

# Define XGBoost learner with the optimal hyperparameters of Probst et al.
# Learner will be added to the pipeline later on, in conjuction with and without class balancing.
xgb_td <- lrn("classif.xgboost", predict_type = 'prob')
xgb_td$param_set$values <- list(
  booster = "gbtree", 
  nrounds = 2563, 
  max_depth = 11, 
  min_child_weight = 1.75, 
  subsample = 0.873, 
  eta = 0.052,
  colsample_bytree = 0.713,
  colsample_bylevel = 0.638,
  lambda = 0.101,
  alpha = 0.894
)

xgb_td_raw <- GraphLearner$new(
  po_raw %>>%
    po('learner', xgb_td, id = 'xgb_td'),
  predict_type = 'prob'
)

xgb_tn_raw <- GraphLearner$new(
  po_raw %>>%
    po('learner', lrn("classif.xgboost",
                      predict_type = 'prob'), id = 'xgb_tn'),
  predict_type = 'prob'
)

xgb_td_up <- GraphLearner$new(
  po_over %>>%
    po('learner', xgb_td, id = 'xgb_td'),
  predict_type = 'prob'
)

xgb_tn_up <- GraphLearner$new(
  po_over %>>%
    po('learner', lrn("classif.xgboost",
                      predict_type = 'prob'), id = 'xgb_tn'),
  predict_type = 'prob'
)

xgb_td_down <- GraphLearner$new(
  po_under %>>%
    po('learner', xgb_td, id = 'xgb_td'),
  predict_type = 'prob'
)

xgb_tn_down <- GraphLearner$new(
  po_under %>>%
    po('learner', lrn("classif.xgboost",
                      predict_type = 'prob'), id = 'xgb_tn'),
  predict_type = 'prob'
)

learners_all <- list(
  xgb_td_raw,
  xgb_tn_raw,
  xgb_td_up,
  xgb_tn_up,
  xgb_td_down,
  xgb_tn_down
)
names(learners_all) <- sapply(learners_all, function(x) x$id)

# Create pipeline as a graph. This way, pipeline can be plotted. Pipeline can then be converted into a learner with GraphLearner$new(pipeline).
# Pipeline is a collection of Graph Learners (type ?GraphLearner in the command line for info).
# Each GraphLearner is a td or tn model (see abbreviations above) with or without class balancing.
# Up/down or no sampling happens within each GraphLearner, otherwise an error during tuning indicates that there are >= 2 data sources.
# Up/down or no sampling within each GraphLearner can be specified by chaining the relevant pipe operators (function po(); type ?PipeOp in command line) with the PipeOp of each learner.
graph <- 
  #po("imputehist") %>>% # Optional. Impute missing values only when using classifiers that can't handle them (e.g. Random Forest).
  po("branch", names(learners_all)) %>>%
  gunion(unname(learners_all)) %>>%
  po("unbranch")

graph$plot() # Plot pipeline

pipe <- GraphLearner$new(graph) # Convert pipeline to learner
pipe$predict_type <- 'prob' # Don't forget to specify we want to predict probabilities and not classes.

ps_table <- as.data.table(pipe$param_set)
View(ps_table[, 1:4])

# Set hyperparameter ranges for the tunable learners
ps_xgboost <- ps_table$id %>%
  lapply(
    function(x) {
      if (grepl('_tn', x)) {
        if (grepl('.booster', x)) {
          ParamFct$new(x, levels = "gbtree")
        } else if (grepl('.nrounds', x)) {
          ParamInt$new(x, lower = 100, upper = 110)
        } else if (grepl('.max_depth', x)) {
          ParamInt$new(x, lower = 3, upper = 10)
        } else if (grepl('.min_child_weight', x)) {
          ParamDbl$new(x, lower = 0, upper = 10)
        } else if (grepl('.subsample', x)) {
          ParamDbl$new(x, lower = 0, upper = 1)
        } else if (grepl('.eta', x)) {
          ParamDbl$new(x, lower = 0.1, upper = 0.6)
        } else if (grepl('.colsample_bytree', x)) {
          ParamDbl$new(x, lower = 0.5, upper = 1)
        } else if (grepl('.gamma', x)) {
          ParamDbl$new(x, lower = 0, upper = 5)
        }
      }
    }
  )
ps_xgboost <- Filter(Negate(is.null), ps_xgboost)
ps_xgboost <- ParamSet$new(ps_xgboost)

# Se parameter ranges for the class balancing strategies
ps_class_balancing <- ps_table$id %>%
  lapply(
    function(x) {
      if (all(grepl('up.', x), grepl('.ratio', x))) {
        ParamDbl$new(x, lower = 1, upper = upsample_ratio)
      } else if (all(grepl('down.', x), grepl('.ratio', x))) {
        ParamDbl$new(x, lower = downsample_ratio, upper = 1)
      }
    }
  )
ps_class_balancing <- Filter(Negate(is.null), ps_class_balancing)
ps_class_balancing <- ParamSet$new(ps_class_balancing)

# Define parameter set
param_set <- ParamSetCollection$new(list(
  ParamSet$new(list(pipe$param_set$params$branch.selection$clone())), # ParamFct can be copied.
  ps_xgboost, 
  ps_class_balancing
))

# Add dependencies. For instance, we can only set the mtry value if the pipe is configured to use the Random Forest (ranger).
# In a similar manner, we want do add a dependency between, e.g. hyperparameter "raw.xgb_td.xgb_tn.booster" and branch "raw.xgb_td"
# See https://mlr3gallery.mlr-org.com/tuning-over-multiple-learners/
param_set$ids()[-1] %>%
  lapply(
    function(x) {
      aux <- names(learners_all) %>%
        sapply(
          function(y) {
            grepl(y, x)
          }
        )
      aux <- names(aux[aux])
      param_set$add_dep(x, "branch.selection", 
                        CondEqual$new(aux))
    }
  )

# Set up tuning instance
instance <- TuningInstance$new(
  task = task_train,
  learner = pipe,
  resampling = rsmp('cv', folds = 2),
  measures = msr("classif.bbrier"),
  #measures = prc_micro,
  param_set,
  terminator = term("evals", n_evals = 3))
tuner <- TunerRandomSearch$new()

# Tune pipe learner to find best-performing branch
tuner$tune(instance)

instance$result
instance$archive() 
instance$archive(unnest = "tune_x") # Unnest the tuner search space values

pipe$param_set$values <- instance$result$params
pipe$train(task_train)

pred <- pipe$predict(task_test)
pred$confusion

请注意，调优器选择忽略可调学习器的调优，而只关注调优学习器。这可以通过检查 instance$result 来确认:唯一为可调学习器调整的是类平衡参数，它们实际上不是学习器超参数。

示例 2:构建仅包含可调学习器的管道，找到“最佳”学习器，然后在第二阶段针对具有固定超参数的学习器进行基准测试。

第 1 步:为可调学习器构建管道

learners_all <- list(
  #xgb_td_raw,
  xgb_tn_raw,
  #xgb_td_up,
  xgb_tn_up,
  #xgb_td_down,
  xgb_tn_down
)
names(learners_all) <- sapply(learners_all, function(x) x$id)

# Create pipeline as a graph. This way, pipeline can be plotted. Pipeline can then be converted into a learner with GraphLearner$new(pipeline).
# Pipeline is a collection of Graph Learners (type ?GraphLearner in the command line for info).
# Each GraphLearner is a td or tn model (see abbreviations above) with or without class balancing.
# Up/down or no sampling happens within each GraphLearner, otherwise an error during tuning indicates that there are >= 2 data sources.
# Up/down or no sampling within each GraphLearner can be specified by chaining the relevant pipe operators (function po(); type ?PipeOp in command line) with the PipeOp of each learner.
graph <- 
  #po("imputehist") %>>% # Optional. Impute missing values only when using classifiers that can't handle them (e.g. Random Forest).
  po("branch", names(learners_all)) %>>%
  gunion(unname(learners_all)) %>>%
  po("unbranch")

graph$plot() # Plot pipeline

pipe <- GraphLearner$new(graph) # Convert pipeline to learner
pipe$predict_type <- 'prob' # Don't forget to specify we want to predict probabilities and not classes.

ps_table <- as.data.table(pipe$param_set)
View(ps_table[, 1:4])

ps_xgboost <- ps_table$id %>%
  lapply(
    function(x) {
      if (grepl('_tn', x)) {
        if (grepl('.booster', x)) {
          ParamFct$new(x, levels = "gbtree")
        } else if (grepl('.nrounds', x)) {
          ParamInt$new(x, lower = 100, upper = 110)
        } else if (grepl('.max_depth', x)) {
          ParamInt$new(x, lower = 3, upper = 10)
        } else if (grepl('.min_child_weight', x)) {
          ParamDbl$new(x, lower = 0, upper = 10)
        } else if (grepl('.subsample', x)) {
          ParamDbl$new(x, lower = 0, upper = 1)
        } else if (grepl('.eta', x)) {
          ParamDbl$new(x, lower = 0.1, upper = 0.6)
        } else if (grepl('.colsample_bytree', x)) {
          ParamDbl$new(x, lower = 0.5, upper = 1)
        } else if (grepl('.gamma', x)) {
          ParamDbl$new(x, lower = 0, upper = 5)
        }
      }
    }
  )
ps_xgboost <- Filter(Negate(is.null), ps_xgboost)
ps_xgboost <- ParamSet$new(ps_xgboost)

ps_class_balancing <- ps_table$id %>%
  lapply(
    function(x) {
      if (all(grepl('up.', x), grepl('.ratio', x))) {
        ParamDbl$new(x, lower = 1, upper = upsample_ratio)
      } else if (all(grepl('down.', x), grepl('.ratio', x))) {
        ParamDbl$new(x, lower = downsample_ratio, upper = 1)
      }
    }
  )
ps_class_balancing <- Filter(Negate(is.null), ps_class_balancing)
ps_class_balancing <- ParamSet$new(ps_class_balancing)

param_set <- ParamSetCollection$new(list(
  ParamSet$new(list(pipe$param_set$params$branch.selection$clone())), # ParamFct can be copied.
  ps_xgboost, 
  ps_class_balancing
))

# Add dependencies. For instance, we can only set the mtry value if the pipe is configured to use the Random Forest (ranger).
# In a similar manner, we want do add a dependency between, e.g. hyperparameter "raw.xgb_td.xgb_tn.booster" and branch "raw.xgb_td"
# See https://mlr3gallery.mlr-org.com/tuning-over-multiple-learners/
param_set$ids()[-1] %>%
  lapply(
    function(x) {
      aux <- names(learners_all) %>%
        sapply(
          function(y) {
            grepl(y, x)
          }
        )
      aux <- names(aux[aux])
      param_set$add_dep(x, "branch.selection", 
                        CondEqual$new(aux))
    }
  )

# Set up tuning instance
instance <- TuningInstance$new(
  task = task_train,
  learner = pipe,
  resampling = rsmp('cv', folds = 2),
  measures = msr("classif.bbrier"),
  #measures = prc_micro,
  param_set,
  terminator = term("evals", n_evals = 3))
tuner <- TunerRandomSearch$new()

# Tune pipe learner to find best-performing branch
tuner$tune(instance)

instance$result
instance$archive() 
instance$archive(unnest = "tune_x") # Unnest the tuner search space values

pipe$param_set$values <- instance$result$params
pipe$train(task_train)

pred <- pipe$predict(task_test)
pred$confusion

请注意，现在 instance$result 也为学习者的超参数返回最佳结果，而不仅仅是类平衡参数。

第 2 步:基准“最佳”可调学习器(现已调整)和具有固定超参数的学习器

# Define re-sampling and instantiate it so always the same split will be used

resampling <- rsmp("cv", folds = 2)

set.seed(123)
resampling$instantiate(task_train)

bmr <- benchmark(
  design = benchmark_grid(
    task_train,
    learner = list(pipe, xgb_td_raw, xgb_td_up, xgb_tn_down),
    resampling
  ),
  store_models = TRUE # Only needed if you want to inspect the models
)

bmr$aggregate(msr("classif.bbrier"))

需要考虑的几个问题

我应该为具有固定超参数的学习者，以便至少类平衡参数调整。然后，两个管道(可调和固定超参数)将使用 benchmark() 进行基准测试。
我应该从头到尾使用相同的重采样策略？即，正确实例化重采样策略在调整第一个管道之前，以便此策略也用于第二个管道和最终基准。

非常欢迎评论/验证。

(特别感谢missuse的建设性意见)

关于r - mlr3 PipeOps : Create branches with different data transformations and benchmark different learners within and between branches，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/61014457/

文章推荐： python - Selenium ChromeDriver "get"无法可靠地加载@import 字体

文章推荐： react-native - 为什么在 RN 0.33 的 iOS 上没有任何图像缓存

文章推荐： ios4 - iOS : how to get image dimensions without opening it

文章推荐： javascript - 当 rollup watcher 触发时如何运行代码

java - 登录: different different files for different log levels
我知道这类问题已经得到解答，但就我而言，我已经尝试了所有配置，但仍然不起作用。我需要对我的配置有一个新的看法(我确信我错过了一些东西)。两个附加程序都会记录所有级别我想将所有包的信息 >= 记录到控
optimization - 针对 ARM : Why different CPUs affects different algorithms differently (and drastically) 进行优化
我正在对 Windows 移动设备上的代码性能进行一些基准测试，并注意到某些算法在某些主机上的表现明显更好，而在其他主机上则明显更差。当然，考虑到时钟速度的差异。供引用的统计数据(所有结果均由同一个
c - 奇怪的问题 : Getting different calculation results of the area and perimeter of a polyngn (on different machines and on different times)
我有一个程序可以计算多边形的面积和周长。程序还会确认面积和周长的计算结果是否与预期结果相同。我不明白发生了什么，但确认面积和周长是否与预期相同的验证部分无法正常工作。例如，我现在测试并在所有情况下
jquery - CSS3 过渡 + jQuery : translations of the x-axis have different results in different browsers for two different items
Codepen :(对于那些想直接进入的人来说，这是一个代码笔。在 Chrome 和 IE 中尝试一下，看看结果的不同) 我正在尝试使用 css3 转换/过渡，因为它们比 jquery 效果更流畅。
python : different regular expressions with different substitutions
我有几个不同的正则表达式要在给定文本中匹配和替换。 regex1 :如果文本包含单词“Founder”，则将所有文本替换为首席执行官正则表达式2:如果文本包含9位数字，则将其替换为NUM 我尝试使用
Java邮件 : How to use different SOCKS5 for different threads?
我编写了多线程应用程序，它从每个线程的数据库连接到一些电子邮件帐户。我知道 JavaMail 没有任何选项可以使用 SOCKS5 进行连接，因此我决定通过 System.setProperty 方法使
iOS Storyboard : Different Layouts for Different Devices
如您所见，这是我当前 Storyboard的不同设备预览。底部的透明绿色被另一个 View Controller 占用，但需要为每个不同的尺寸类固定间距。我尝试将 Storyboard 中的宽度和高度
swift 2 : Different gravity to different sprites
我正在创建一个游戏，我需要能够改变玩家 Sprite 的速度。我认为最好的选择是通过重力影响 Sprite 。为了给用户运动的感觉，我希望背景以完全相同的速度向相反的方向移动。我怎样才能给背景一个不
python - B树 : Is there a difference between different TreeSet incarnations?
我正在查看BTrees库并注意到有多个 TreeSet (和其他)类，例如 BTrees.IOBTree.TreeSet BTrees.OOBTree.TreeSet BTrees.LFBTree.T
安卓NDK : Compiling different libraries for different architectures
我有一个小型 C++ 库，必须为 armeabi 和 armeabi7a 编译。我还有一个非常大的 c++ 库，只需要为 armeabi 编译。现在正在为两种架构编译它们(使用 NDK)，但这使我的
reactjs - MuiThemeProvider : How to use different themes for different routes?
我需要根据站点的当前部分稍微更改主题。似乎 MuiThemeProvider 只在加载时设置 muiTheme；但需要在 props 变化时更新。如何做到这一点？最佳答案您可以尝试将主题放在包
latex 列表 : different counters for different listing environments
如何创建两个每个都有自己的计数器的 lSTListing 环境？如果我使用例如 \lstnewenvironment{algorithm}[2]{ \renewcommand\lstlist
travis-ci - 特拉维斯 : different `script` for different branch?
我想使用 Travis-CI 和 Github 基于分支设置部署。 IE。 - 如果我们从 develop 构建- 然后执行 /deploy.rb使用 DEV 环境主机名，如果 master - 然后
wpf - 数据绑定(bind) : Different triggers for different purposes
我有一个带有数据验证的 WPF MVVM 数据表单窗口。很多控件都是文本框。目前，数据绑定(bind)触发器设置为默认值，即。 e.失去焦点。这意味着仅在可能完全填写字段时才对其进行验证。所以当删除一
Xamarin 表单 : Is it normal to have different screen for different viewModel
我有许多应用程序的内容页面，并最终为每个内容页面编写了很多 View 模型。例如。如果我有一个包含项目组的列表，我将有一个 ShowAllViewModel并绑定(bind)到内容页面和列表中单个项目
javascript - Backbone : Different views for different tab content
我有一个通用 View 和 4 个其他 View 。我在通用 View 中使用 Bootstrap 选项卡(导航选项卡)。我希望其他 4 个 View 成为通用 View 中 4 个选项卡的内容。由于
maven-2 - Maven : Different configuration for different goals
我希望针对 Maven 发布插件的不同目标有不同的配置选项。故事是这样的: 我正在将 Git 用于 SCM。我希望release:prepare插件在本地完成所有操作，并让release:perfor
java - Java中的TableModel : how to specify different renderers for different rows?
我正在为一个项目使用AbstractTableModel制作一个自定义TableModel，并且我需要找到一种方法让复选框显示在某些行上，而不是其他行上。我已经实现了 getColumn 方法，但我希
JavaScript 事件循环 : Different queue for different types of events?
摘自《Javascript 忍者的 secret 》一书: EVENTS ARE ASYNCHRONOUS Events, when they happen, can occur at unpredi
java - GWT 记录器 : Different Levels to Different Handlers
我正在尝试配置我的第一个 GWT 记录器，到目前为止，我已经将日志消息打印到我的 JS 控制台(FF 的 Firebug): 最终，我希望非SEVERE 消息转到consoleHa

行者123

个人简介

我是一名优秀的程序员,十分优秀！

作者热门文章

滴滴打车优惠券免费领取

全站热门文章

首页

博学

6Ren·AI

商城

r - mlr3 PipeOps : Create branches with different data transformations and benchmark different learners within and between branches