gpt4 book ai didi

r - 在测试数据中具有未知因子水平的predict.lm()

转载 作者:行者123 更新时间:2023-12-03 08:40:02 26 4
gpt4 key购买 nike

我正在拟合一个模型来分解数据并进行预测。如果newdata中的predict.lm()包含模型未知的单个因子级别,则所有predict.lm()都会失败并返回错误。

有没有一种好方法可以让predict.lm()返回模型已知的那些因子水平的预测,而对于未知的因子水平返回NA,而不仅仅是一个错误?

示例代码:

foo <- data.frame(response=rnorm(3),predictor=as.factor(c("A","B","C")))
model <- lm(response~predictor,foo)
foo.new <- data.frame(predictor=as.factor(c("A","B","C","D")))
predict(model,newdata=foo.new)

我希望最后一条命令返回与因子级别“A”,“B”和“C”相对应的三个“真实”预测以及与未知水平“D”相对应的 NA

最佳答案

通过MorgenBall整理和扩展了功能。现在也可以在sperrorest中实现。

附加功能

  • 删除未使用的因子水平,而不仅仅是将缺少的值设置为NA
  • 向用户发出一条消息,提示已降低因子水平
  • 检查test_data中是否存在因子变量,并返回原始data.frame(如果不存在)
  • 不仅适用于lmglm,而且还适用于glmmPQL

  • 注意:此处显示的功能可能会随时间变化(改进)。
    #' @title remove_missing_levels
    #' @description Accounts for missing factor levels present only in test data
    #' but not in train data by setting values to NA
    #'
    #' @import magrittr
    #' @importFrom gdata unmatrix
    #' @importFrom stringr str_split
    #'
    #' @param fit fitted model on training data
    #'
    #' @param test_data data to make predictions for
    #'
    #' @return data.frame with matching factor levels to fitted model
    #'
    #' @keywords internal
    #'
    #' @export
    remove_missing_levels <- function(fit, test_data) {

    # https://stackoverflow.com/a/39495480/4185785

    # drop empty factor levels in test data
    test_data %>%
    droplevels() %>%
    as.data.frame() -> test_data

    # 'fit' object structure of 'lm' and 'glmmPQL' is different so we need to
    # account for it
    if (any(class(fit) == "glmmPQL")) {
    # Obtain factor predictors in the model and their levels
    factors <- (gsub("[-^0-9]|as.factor|\\(|\\)", "",
    names(unlist(fit$contrasts))))
    # do nothing if no factors are present
    if (length(factors) == 0) {
    return(test_data)
    }

    map(fit$contrasts, function(x) names(unmatrix(x))) %>%
    unlist() -> factor_levels
    factor_levels %>% str_split(":", simplify = TRUE) %>%
    extract(, 1) -> factor_levels

    model_factors <- as.data.frame(cbind(factors, factor_levels))
    } else {
    # Obtain factor predictors in the model and their levels
    factors <- (gsub("[-^0-9]|as.factor|\\(|\\)", "",
    names(unlist(fit$xlevels))))
    # do nothing if no factors are present
    if (length(factors) == 0) {
    return(test_data)
    }

    factor_levels <- unname(unlist(fit$xlevels))
    model_factors <- as.data.frame(cbind(factors, factor_levels))
    }

    # Select column names in test data that are factor predictors in
    # trained model

    predictors <- names(test_data[names(test_data) %in% factors])

    # For each factor predictor in your data, if the level is not in the model,
    # set the value to NA

    for (i in 1:length(predictors)) {
    found <- test_data[, predictors[i]] %in% model_factors[
    model_factors$factors == predictors[i], ]$factor_levels
    if (any(!found)) {
    # track which variable
    var <- predictors[i]
    # set to NA
    test_data[!found, predictors[i]] <- NA
    # drop empty factor levels in test data
    test_data %>%
    droplevels() -> test_data
    # issue warning to console
    message(sprintf(paste0("Setting missing levels in '%s', only present",
    " in test data but missing in train data,",
    " to 'NA'."),
    var))
    }
    }
    return(test_data)
    }

    我们可以将此函数应用于问题中的示例,如下所示:
    predict(model,newdata=remove_missing_levels (fit=model, test_data=foo.new))

    在尝试改善此功能时,我遇到了一个事实,即像 lmglm等这样的SL学习方法在训练和测试中需要相同的级别,而如果删除了这些学习方法,则ML学习方法( svmrandomForest)将失败。这些方法需要培训和测试的各个级别。

    通用解决方案很难实现,因为每个拟合模型都有不同的方式来存储其因子级别分量( fit$xlevelslmfit$contrastsglmmPQL)。至少在 lm相关模型之间似乎是一致的。

    关于r - 在测试数据中具有未知因子水平的predict.lm(),我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/4285214/

    26 4 0
    Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
    广告合作:1813099741@qq.com 6ren.com