r - 当因子水平只有一个水平时，将 predict() 与 RcppArmadillo/RcppEigen 结合使用-6ren

r - 当因子水平只有一个水平时，将 predict() 与 RcppArmadillo/RcppEigen 结合使用

转载作者：行者123 更新时间：2023-12-04 01:23:22

25

4

我对 predict() 的使用有疑问功能与 RcppArmadillo和 RcppEigen包，当因子变量只有一个水平时。我在下面使用 iris 构建了一个 MWE数据集。

我想首先使用 RcppArmadillo 估计线性回归模型，然后用它来预测值。我用于估计的数据包含因子变量(具有多个级别且没有 NA )。我想做的预测在一个方面有点不寻常:我想对所有观察使用相同的因子水平来预测值(这个水平在估计中使用的水平)。在下面的示例中，这意味着我要预测 Sepal.Length好像所有观察结果都来自“云芝”物种。

当我使用 lm() 估计模型时，这很有效功能，但不适用于 RcppArmadillo::fastLm()或 RcppEigen::fastLm()功能。我收到以下错误:Error in `contrasts<-`(`*tmp*`, value = contr.funs[1 + isOF[nn]]) : contrasts can be applied only to factors with 2 or more levels .如果其中一个因子水平缺失，同样的错误会再次发生。我很清楚为什么估计至少需要两个级别，但我不明白为什么一旦模型被正确估计，只有一个级别对于预测来说是一个问题。

显而易见的解决方案是使用 lm()而不是 fastLm() ，但不幸的是这是不可能的，因为我的数据很大。经过反复试验，我发现了这个肮脏的解决方法:

我堆叠了两个版本的数据:第一个是原始数据(具有所有因子水平)，第二个是修改后的数据(所有观测值具有相同的因子水平)；
我预测此数据集的值(之所以有效，是因为此数据集中存在所有因素水平)；
我只保留修改后的数据子集。

有没有人有比这更好的解决方案，或者至少可以解释为什么会出现此错误？

library(data.table)

# Loading iris data
iris <- as.data.table(iris)

# Estimating the model
model <-
  RcppArmadillo::fastLm(Sepal.Length ~ 
                 factor(Species)
               + Sepal.Width 
               + Petal.Length 
               + Petal.Width,
               data=iris)

summary(model)

#### 
#### Here is the error I don't understand
#### 

# This is the standard use of the predict function
iris2 <- copy(iris)
iris2[, predict := predict(model, iris2)]

# This is the way I want to use the predict function
# This does not work for some reason
iris2 <- copy(iris)
iris2[, Species := "versicolor"]
iris2[, predict2 := predict(model, iris2)]

#### 
#### This is a dirty work-around
#### 

# Creating a modified dataframe
iris3 <- copy(iris)
iris3[, `:=`(Species = "versicolor",
             data = "Modified data")]

# copying the original dataframe
iris4 <- copy(iris)
iris4[, data := "Original data"]

# Stacking the original data and the modified data
iris5 <- rbind(iris3, iris4)
iris5[, predict := predict(model, iris5)]

# Keeping only the modified data
iris_final <- iris5[data == "Modified data"]

最佳答案

不是解决方案，而是解释为什么会发生。

如果我们检查 RcppAramdillo:::predict.fastLm() 的源代码，我们会发现它通过以下方式构造预测点的设计矩阵

x <- model.matrix(object$formula, newdata)

另一方面，如果我们检查 stats::predict.lm() 的源代码，我们会发现

tt <- terms(object)
## Some source omitted here
Terms <- delete.response(tt)
m <- model.frame(Terms, newdata, na.action = na.action, xlev = object$xlevels)
if (!is.null(cl <- attr(Terms, "dataClasses")))  .checkMFClasses(cl, m)
X <- model.matrix(Terms, m, contrasts.arg = object$contrasts)

这表明 lm() 在其结果中存储了有关预测变量的因子水平和对比的信息，而 fastLm() 在 predict 中重建了该信息()调用:

names(model)
# [1] "coefficients"  "stderr"        "df.residual"   "fitted.values"
# [5] "residuals"     "call"          "intercept"     "formula"      
names(lm_mod) ## Constructed with `lm()` call with same formula
#  [1] "coefficients"  "residuals"     "effects"       "rank"         
#  [5] "fitted.values" "assign"        "qr"            "df.residual"  
#  [9] "contrasts"     "xlevels"       "call"          "terms"        
# [13] "model"

注意 lm 对象中的 "xlevels" 和 "contrasts" 元素在 fastLm 对象。不过，从 help("fastLM") 中可以看出更重要的一点:

 
   
   Linear models should be estimated using the lm function. In some cases, lm.fit may be appropriate.
  
如果我错了，Dirk 可以纠正我，但我认为 fastLm() 的目的不是提供一个丰富的 OLS 实现来涵盖 stats::lm() 做；我认为它更具说明性。
如果您的问题是大数据，这就是您不想使用 stats::lm() 的原因，我可以建议像 biglm::biglm()? (参见，例如，here)。如果您真的打算使用 RcppArmadillo::fastLm()，您可以使用较小版本的解决方法；无需复制整个数据，只需将一行添加到每个未使用的因子水平的预测集中即可。

 
  
  关于r - 当因子水平只有一个水平时，将 predict() 与 RcppArmadillo/RcppEigen 结合使用，我们在Stack Overflow上找到一个类似的问题：  https://stackoverflow.com/questions/62282727/

25

4

0

文章推荐： javascript - 单击关闭按钮后未显示 Google 一键登录 UI

文章推荐： google-bigquery - 返回表的 UDF

文章推荐： python - pandas drop_duplicates 并保留最接近引用时间的值

RcppArmadillo 传递用户定义函数
考虑以下 R 代码， ## ----------- R version ----------- caller using namespace arma ; using n
r - 无法编译 RcppArmadillo
RcppArmadillo 是我尝试安装的一些软件包的依赖项。我在编译 RcppArmadillo 版本 0.10.1.0.0 时收到此错误(这是 R 在发现 RcppArmadillo 是一个 de
RcppArmadillo Gamma 分布在具有相同种子的平台之间不同
我正在研究 a package ，它使用来自 RcppArmadillo 的随机数。该软件包运行 MCMC 算法，为了获得精确的再现性，用户应该能够设置随机数种子。执行此操作时，似乎用于从 Gamma
loops - RcppArmadillo:for 循环中的负索引
我是 Rcpp 的新手，正在尝试基于 for() 中的负索引执行计算- 使用 RcppArmadillo 循环。我已经发现 RcppArmadillo 中的负索引不是那么简单，但是可以通过应该保留的
c++ - RcppArmadillo 中稀疏和密集矩阵的模板函数
我正在尝试使用 RcppArmadillo 定义一个可以处理稀疏和密集矩阵输入的模板函数。我得到了一个非常简单的案例，将一个密集或稀疏矩阵发送到 C++，然后返回到 R 以像这样工作: library
c++ - RcppArmadillo 不支持格式？
我试图找到一个非常大的稀疏矩阵的特征值。我正在使用 RcppArmadillo 的 eig_gen 函数，它不是专门用于稀疏矩阵的，但只要计算是以单精度完成的，我就可以接受。所以我的 cpp 代码是:
c++ - RcppArmadillo 是否需要预先实例化所需的参数？
我在我的 Rcpp 代码中使用 RcppArmadillo::sample，它在下面有这种奇怪的行为。 fun_good 按预期工作，从 x vector 中采样 1 个元素。然而，fun_bad 不
r - 如何使用 RcppArmadillo 绘制多项分布样本？
问题是我有一个变量 arma::mat prob_vec并想要相当于 rmultinom(1, 1, prob_vec) 的东西在 R。我找到了 rmultinom RcppArmadillo 提供
r - 如何在 RcppArmadillo 中复制随机抽奖？
这是一个用于绘制 N 的 C++ 函数均值为零和标准差的独立正态偏差s // [[Rcpp::depends(RcppArmadillo)]] #include using namespace Rc
r - 如何使用 RcppArmadillo 将距离矩阵的对角线强制为零？
我有以下 Rcpp/RcppArmadillo 函数，它计算矩阵中的相关距离 #include using namespace Rcpp; // [[Rcpp::export]] arma::mat
RcppArmadillo 和 arma 命名空间
开始有 R 经验，但完全是 C++ 新手，我用 RcppArmadillo 编写了一些函数，并且对它的可用性和速度非常热情。我现在想使用函数 RcppArmadillo.package.skeleto
r - 将列向量乘以 RcppArmadillo 中的数值标量
我在编译这个简单的 c++ 时遇到了一些麻烦代码使用 Rcpp和 RcppArmadillo包裹。以下面的简单示例将矩阵的每一列乘以数值标量: code (m); for(int i = 0; i :
c++ - 在 RcppArmadillo 中使用字段
问候和称呼，我正在尝试使用字段对象类型而不是列表数据类型来避免必须发出复制命令。我试图这样做是为了减少与从列表中删除一个矩阵相关的时间，该矩阵由 Armadillo 的数据结构中已经定义的矩阵进行操
c++ - 使用 RcppArmadillo 修改输入
我正在尝试使用 RcppArmadillo 通过完全旋转来实现 LU 分解。幸运的是我有 this可以执行我想要的操作的 Matlab 代码，但我在将其转换为 Armadillo 时遇到了一些挑战。
c++ - RcppArmadillo 中的向量化 log1p()
将 log1p() 应用于整个 arma::vec 的合适方法是什么？似乎有 log() 和 exp() 的矢量化版本，但没有 log1p()。我发现 NumericVector 有语法糖，所以我最终
rcpp - RcppArmadillo 和 RcppParallel 的同居
以下玩具示例为 parallelFor工作正常( f2 是 f1 的并行版本): // [[Rcpp::depends(RcppParallel)]] // [[Rcpp::depends(RcppA
r - 无法在 R 中编译 RcppArmadillo
我在编译 RcppArmadillo 时遇到问题。这是我尝试安装软件包时的结果: > install.packages("RcppArmadillo") Installing package(s)
r - RcppArmadillo 和 Armadillo 之间的性能差异
我试图了解用 RcppArmadillo 编写的函数与使用 Armadillo 库在独立 C++ 程序中编写的函数之间的性能差异。例如，考虑以下简单函数，该函数使用传统教科书公式计算线性模型的系数。
r - RcppArmadillo 中的 QR 分解
真的很困惑为什么使用 RcppArmadillo 的 QR 输出与 R 的 QR 输出不同； Armadillo 文档也没有给出明确的答案。本质上，当我给 R 一个矩阵 Y 是 n * q (比如 1
Rcpparmadillo : can't call Fortran routine "dgebal"?
我需要使用名为 dgebal 的 Fortran 例程(文档 here )在我的 Rcpparmadillo 代码中。我已经包含了以下标题: # include # include 但是，当我尝试

首页

博学

6Ren·AI

商城

r - 当因子水平只有一个水平时，将 predict() 与 RcppArmadillo/RcppEigen 结合使用