gpt4 book ai didi

R 插入符号 : How do I apply separate pca to different dataframes before training?

转载 作者:行者123 更新时间:2023-12-01 00:10:04 25 4
gpt4 key购买 nike

我在 R 中使用 caret。我的最终目标是提交不同的数据帧以分离 preProcess pca,然后将 PCA 组件放在一起进行岭回归训练。但是,请参见下面的示例代码,其中在训练函数内与外/前训练函数中在 preProcess 中应用 pca 时,我没有得到相同的结果。

  • 为什么我没有得到相同的结果?
  • 我如何以最好的方式实现我的主要目标?
  • #Sample data
    s <- c(-0.412440717220306, -0.459911376237869, -0.234769582748413, -0.332282930612564, -0.486973077058792, -0.301480442285538, -0.181094691157341, -0.240918189287186, 0.0962697193026543, -0.119731709361076, -0.389783203601837, -0.217093095183372, -0.302948802709579, -0.406619131565094, 0.247409552335739, -0.406119048595428, 0.0574243739247322, -0.301231145858765, -0.229316398501396, -0.0620433799922466)
    t <- c(0.20061232149601, 0.0536709427833557, 0.530373573303223, 0.523406386375427, 0.267315864562988, 0.413556098937988, 0.274257719516754, 0.275401413440704, 0.634453296661377, 0.145272701978683, 0.196711808443069, 0.332845687866211, 0.345706522464752, 0.444085538387299, 0.253269702196121, 0.231440827250481, -0.196317762136459, 0.49691703915596, 0.43754768371582, 0.0106721892952919)
    u <- c(-0.565160751342773, 0.377725303173065,-0.273447960615158, -0.338064402341843, -0.59904420375824, -0.780133605003357,-0.508388638496399, -0.226167500019073, -0.257708549499512, -0.349863946437836,-0.443032741546631, -0.36387038230896, -0.455201774835587, -0.137616977095604,0.130770832300186, -0.420618057250977, -0.125859051942825, -0.382272869348526, -0.355217516422272, -0.0601325333118439)
    v <- c(-0.45850995182991, -0.0105021595954895, -0.475157409906387, -0.325350821018219, -0.548444092273712, -0.562069535255432, -0.473256289958954, -0.492668628692627, -0.205974608659744, -0.266964733600616, -0.289298176765442, -0.615423858165741, -0.261823982000351, -0.472221553325653, -0.684594392776489, -0.42777806520462, -0.240604877471924, -0.589631199836731, -0.782602787017822, -0.468854814767838)
    w <- c(-0.886135756969452, -0.96577262878418,-0.755464434623718, -0.640497982501984, -0.849709093570709, -0.837802410125732, -0.659287571907043, -0.646972358226776, 0.0532735884189606, -0.646163880825043,-0.963890254497528, -0.91286826133728, -1.10484659671783, -0.596551716327667, -0.371927708387375, -0.684276521205902, -0.55376398563385, -0.969008028507233, -0.956810772418976, -0.0229262933135033)

    y <- c(9, 26, 30, 15, 25, 30, 30, 35, 35, 30, 21, 30, 9, 33, 31, 34, 29, 35, 25, 31)

    #Sample data for procedure 1 and 2
    df_test1 <- data.frame(s, t, u, v, w)
    df_test2 <- df_test1

    #PROCEDURE 1: preProcess (pca) applied WITHIN "train" function
    library(caret)
    ytrain_df_test <- c(1:nrow(df_test1)) # number of observation that should be split in to the number of folds.
    ntrain <- length(ytrain_df_test)
    # define folds
    cv_folds <- createFolds(ytrain_df_test, k = 10, list = TRUE, returnTrain = TRUE) #, ...)
    # define training control
    train_control <- trainControl(method="cv", index = cv_folds, savePredictions = 'final') #, ...)
    #adding y
    df_test1$y <- y
    # train the model
    set.seed(1)
    model1 <- caret::train(y~., data=df_test1, trControl=train_control, method= 'ridge', preProcess = 'pca')
    output1 <- list(model1, model1$pred, summary(model1$pred), cor.test(model1$pred$pred, model1$pred$obs))
    names(output1) <- c("Model", "Model_pred", "Summary", "Correlation")
    output1


    #PROCEDURE 2: preProcess (pca) applied OUTSIDE/BEFORE "train" function
    ytrain_df_test <- c(1:nrow(df_test2)) # number of observation that should be split in to the number of folds.
    ntrain <- length(ytrain_df_test)

    df2 <- preProcess(df_test2, method="pca", thresh = 0.95)
    df_test2 <- predict(df2, df_test2)
    df_test2$y <- y
    df_test2
    # define folds
    cv_folds <- createFolds(ytrain_df_test, k = 10, list = TRUE, returnTrain = TRUE)
    # define training control
    train_control <- trainControl(method="cv", index = cv_folds, savePredictions = 'final')
    # train the model
    set.seed(1)
    model2 <- caret::train(y~., data=df_test2, trControl=train_control, method= 'ridge') #, preProcess = 'pca')
    model2
    output2 <- list(model2, model2$pred, summary(model2$pred), cor.test(model2$pred$pred, model2$pred$obs))
    names(output2) <- c("Model", "Model_pred", "Summary", "Correlation")
    output2```

    最佳答案

    1.
    当您在 train 函数中执行 preProcess (pca) 时:

  • pca 在 CV 期间在每个训练集上运行,并且训练集被转换
  • 在每个这些转换后的训练集上估计了几个岭回归模型(基于定义的超参数搜索)。
  • 基于为每个训练集获得的 pca,适当的测试集被转换
  • 所有拟合模型都在适当的转换测试集上进行评估

  • 完成后,最终模型将使用在测试集上具有最佳平均性能的超参数构建:
  • pca 应用于整个训练集数据,并获得转换后的训练数据。
  • 使用预先选择的超参数,在转换后的训练数据上建立岭回归模型

  • 当您在训练功能之前执行预处理 (pca) 时,您会导致 data leakage因为您正在使用来自 CV 测试折叠的信息来估计 pca 坐标。这会导致 CV 期间出现乐观偏差,应避免。

    2.
    我不知道内置的插入符号功能可以提供这种处理多个数据集的功能。
    我相信这可以通过 mlr3pipelines 实现.尤其是这个 tutorial很方便。

    这是一个关于如何将 iris 数据集拆分为两个数据集,对每个数据集应用缩放和 pca,组合转换后的列并拟合 rpart 模型的示例。使用随机搜索调整保留的 PCA 组件的数量以及一个 rpart 超参数:

    包:
    library(mlr3pipelines)
    library(visNetwork)
    library(mlr3learners)
    library(mlr3tuning)
    library(mlr3)
    library(paradox)

    定义一个名为“slct1”的管道选择器:
    pos1 <- po("select", id = "slct1")

    告诉它选择哪些列:
    pos1$param_set$values$selector <- selector_name(c("Sepal.Length", "Sepal.Width"))

    告诉它在获取特征后要做什么
    pos1 %>>%
    mlr_pipeops$get("scale", id = "scale1") %>>%
    mlr_pipeops$get("pca", id = "pca1") -> pr1

    定义一个名为“slct2”的管道选择器:
    pos2 <- po("select", id = "slct2")

    告诉它选择哪些列:
    pos2$param_set$values$selector <- selector_name(c("Petal.Length", "Petal.Width"))

    告诉它在获取特征后要做什么
    pos2 %>>%
    mlr_pipeops$get("scale", id = "scale2") %>>%
    mlr_pipeops$get("pca", id = "pca2") -> pr2

    结合两个输出:
    piper <- gunion(list(pr1, pr2)) %>>%
    mlr_pipeops$get("featureunion")

    并将它们输送到学习器中:
    graph <- piper %>>%
    mlr_pipeops$get("learner",
    learner = mlr_learners$get("classif.rpart"))

    让我们检查一下它的外观:
    graph$plot(html = TRUE)

    enter image description here

    现在定义应该如何调整:
    glrn <- GraphLearner$new(graph)

    10倍简历:
    cv10 <- rsmp("cv", folds = 10)

    调整为每个数据集保留的 PCA 维数以及 rpart 的复杂性参数:
    ps <- ParamSet$new(list(
    ParamDbl$new("classif.rpart.cp", lower = 0, upper = 1),
    ParamInt$new("pca1.rank.", lower = 1, upper = 2),
    ParamInt$new("pca2.rank.", lower = 1, upper = 2)
    ))

    定义任务和调整:
    task <- mlr_tasks$get("iris")

    instance <- TuningInstance$new(
    task = task,
    learner = glrn,
    resampling = cv10,
    measures = msr("classif.ce"),
    param_set = ps,
    terminator = term("evals", n_evals = 20)
    )

    启动随机搜索:
    tuner <- TunerRandomSearch$new()
    tuner$tune(instance)
    instance$result

    也许这也可以通过 tidymodels 来完成悬停我还没有尝试过。

    编辑:回答评论中的问题。

    为了完全掌握mlr3,我建议您阅读 book以及每个附件包的教程。

    在上面的例子中,为每个数据集保留的 PCA 维数与 cp 一起调整。超参数。这是在这一行中定义的:
    ps <- ParamSet$new(list(
    ParamDbl$new("classif.rpart.cp", lower = 0, upper = 1),
    ParamInt$new("pca1.rank.", lower = 1, upper = 2),
    ParamInt$new("pca2.rank.", lower = 1, upper = 2)
    ))

    所以对于 pca1,算法可以选择 1 或 2 pc 来保留(我这样设置,因为每个数据集中只有两个特征)

    如果您不想为了优化性能而调整维数,那么您可以定义 pipeop像这样:
    pos1 %>>%
    mlr_pipeops$get("scale", id = "scale1") %>>%
    mlr_pipeops$get("pca", id = "pca1", param_vals = list(rank. = 1)) -> pr1

    在这种情况下,您应该从参数集中省略它:
    ps <- ParamSet$new(list(
    ParamDbl$new("classif.rpart.cp", lower = 0, upper = 1)
    ))

    据我所知,解释的方差目前无法调整,只是 pca 转换的保留维度数。

    要更改预测类型,可以定义一个学习器:
    learner <- mlr_pipeops$get("learner",
    learner = mlr_learners$get("classif.rpart"))

    并设置预测类型:
    learner$learner$predict_type <- "prob"

    然后创建图形:
    graph <- piper %>>%
    learner

    获取每个超参数组合的性能:
    instance$archive(unnest = "params")

    获取每个超参数组合的预测:
    lapply(as.list(instance$archive(unnest = "params")[,"resample_result"])$resample_result,
    function(x) x$predictions())

    获取最佳超参数组合的预测:
    instance$best()$predictions()

    如果您希望以数据框的形式显示:
    do.call(rbind,
    lapply(instance$best()$predictions(),
    function(x) data.frame(x$data$tab,
    x$data$prob)))

    可能有一些辅助功能可以使这更容易我只是玩得不够。

    关于R 插入符号 : How do I apply separate pca to different dataframes before training?,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/59452615/

    25 4 0
    Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
    广告合作:1813099741@qq.com 6ren.com