c++ - 计算 LogicalMatrix R/C++/Rcpp 的全真行的最快方法-6ren

c++ - 计算 LogicalMatrix R/C++/Rcpp 的全真行的最快方法

转载作者：太空狗更新时间：2023-10-29 23:50:16

我需要计算 LogicalMatrix 中全部为 `TRUE` 的行数。

因为我需要能够相对规律地执行此操作 1 - 25 亿次速度实际上非常重要:

我目前最好的:

我想出的最有效/最快的单进程方法是使用多少个 Rcpp 函数 (hm2)。

我有限的分析能力表明，绝大多数时间都花在了 if(r_tll == xcolls){... 上。我似乎想不出更快的不同算法(我已经尝试在发现 FALSE 时立即跳出循环，但速度要慢得多)。

可以假设的细节:

我可以假设:

矩阵的行数始终少于 1000 万行。
来自上游的所有输出矩阵将具有相同数量的列(对于给定的 session /进程/线程)。
每个矩阵永远不会超过 2326 列。

最小的例子:

m <- matrix(sample(c(T,F),50000*10, replace = T),ncol = 10L)
head(m)
#>       [,1]  [,2]  [,3]  [,4]  [,5]  [,6]  [,7]  [,8]  [,9] [,10]
#> [1,] FALSE FALSE  TRUE FALSE FALSE  TRUE  TRUE  TRUE  TRUE FALSE
#> [2,] FALSE FALSE FALSE  TRUE FALSE  TRUE FALSE FALSE FALSE  TRUE
#> [3,] FALSE  TRUE FALSE FALSE  TRUE FALSE FALSE FALSE FALSE  TRUE
#> [4,]  TRUE  TRUE FALSE  TRUE  TRUE  TRUE FALSE FALSE FALSE  TRUE
#> [5,]  TRUE FALSE FALSE FALSE  TRUE  TRUE  TRUE FALSE  TRUE  TRUE
#> [6,] FALSE FALSE FALSE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE FALSE

  // [[Rcpp::export]]
int hm(const LogicalMatrix& x){
  const int xrows = x.nrow();
  const int xcols = x.ncol();
  int n_all_true = 0;

  for(size_t row = 0; row < xrows; row++) {
    int r_ttl = 0;
    for(size_t col = 0; col < xcols; col++) {
      r_ttl += x(row,col);
    }
    if(r_ttl == xcols){
      n_all_true++;
    }
  }
  return n_all_true;
}

我不明白为什么，但在我的机器上，如果我烘烤 cols 的数量，它会更快(如果有人能解释为什么会这样，那就太好了):

// [[Rcpp::export]]
int hm2(const LogicalMatrix& x){
  const int xrows = x.nrow();
  // const int xcols = x.ncol();
  int n_all_true = 0;

  for(size_t row = 0; row < xrows; row++) {
    int r_ttl = 0;
    for(size_t col = 0; col < 10; col++) {
      r_ttl += x(row,col);
    }
    if(r_ttl == 10){
      n_all_true += 1;
    }
  }
  return n_all_true;
}

时间:

microbenchmark(hm(m), hm2(m), times = 1000)
#>  Unit: microseconds
#>   expr     min       lq     mean  median       uq      max neval
#>  hm(m) 597.828 599.0995 683.3482 605.397 643.8655 1659.711  1000
#> hm2(m) 236.847 237.6565 267.8787 238.748 253.5280  683.221  1000

最佳答案

使用 OpenMP(我现在看到的是针对请求单线程解决方案的问题)仍然可以快 30%，并且代码更改最少，至少在我的 4 核 Xeon 上是这样。我觉得合乎逻辑的 AND 减少可能会做得更好，但改天再说:

library(Rcpp)
library(microbenchmark)

m_rows <- 10L
m_cols <- 50000L
rebuild = FALSE

cppFunction('int hm(const LogicalMatrix& x)
{
  const int xrows = x.nrow();
  const int xcols = x.ncol();
  int n_all_true = 0;

  for(size_t row = 0; row < xrows; row++) {
    int r_ttl = 0;
    for(size_t col = 0; col < xcols; col++) {
      r_ttl += x(row,col);
    }
    if(r_ttl == xcols){
      n_all_true++;
    }
  }
  return n_all_true;
}', rebuild = rebuild)

hm3 <- function(m) {
  nc <- ncol(m)
  sum(rowSums(m) == nc)
}

cppFunction('int hm_jmu(const LogicalMatrix& x)
{
  const int xrows = x.nrow();
  const int xcols = x.ncol();
  int n_all_true = 0;

  for(int row = 0; row < xrows; row++) {
    int r_ttl = 0;
    for(int col = 0; col < xcols; col++) {
      r_ttl += x(row,col);
    }
    if(r_ttl == xcols){
      n_all_true++;
    }
  }
  return n_all_true;
}', rebuild = rebuild)

macroExpand <- function(NCOL) {
  paste0('int hm_npjc(const LogicalMatrix& x)
{
  const int xrows = x.nrow();
  int n_all_true = 0;

  for(int row = 0; row < xrows; row++) {
  int r_ttl = 0;
  for(int col = 0; col < ',NCOL,'; col++) {
  r_ttl += x(row,col);
  }
  if(r_ttl == ',NCOL,'){
  n_all_true++;
  }
  }
  return n_all_true;
  }')
}

macroExpand_omp <- function(NCOL) {
  paste0('int hm_npjc_omp(const LogicalMatrix& x)
{
  const int xrows = x.nrow();
  int n_all_true = 0;

  #pragma omp parallel for reduction(+:n_all_true)
  for(int row = 0; row < xrows; row++) {
  int r_ttl = 0;
  for(int col = 0; col < ',NCOL,'; col++) {
  r_ttl += x(row,col);
  }
  if(r_ttl == ',NCOL,'){
  n_all_true++;
  }
  }
  return n_all_true;
  }')
}

cppFunction(macroExpand(m_rows), rebuild = rebuild)
cppFunction(macroExpand_omp(m_rows),  plugins = "openmp", rebuild = rebuild)

cppFunction('int hm_omp(const LogicalMatrix& x) {
const int xrows = x.nrow();
  const int xcols = x.ncol();
  int n_all_true = 0;

  #pragma omp parallel for reduction(+:n_all_true) schedule(static)
  for(size_t row = 0; row < xrows; row++) {
    int r_ttl = 0;
    for(size_t col = 0; col < xcols; col++) {
      r_ttl += x(row,col);
    }
    if(r_ttl == xcols){
      n_all_true++;
    }
  }
  return n_all_true;
}',  plugins = "openmp", rebuild = rebuild)

# using != as inner loop control - no difference, using pre-increment in n_all_true, no diff, static vs dynamic OpenMP, attempted to direct clang and gcc to unroll loops: didn't seem to work

set.seed(21)
m <- matrix(sample(c(TRUE, FALSE), m_cols * m_rows, replace = T), ncol = m_rows)
print(microbenchmark(hm(m), hm3(m), hm_jmu(m), hm_npjc(m),
                     hm_omp(m), hm_npjc_omp(m),
                     times = 1000))

我使用的是 GCC 4.9。 Clang 3.7 类似的结果。给予:单位:微秒 expr min lq mean 中位数 uq max neval hm(m) 614.074 640.9840 643.24836 641.462 642.9920 976.694 1000 hm3(m) 2705.066 2768.3080 2948.39388 2775.992 2786.8625 43424.060 1000 hm_jmu(m) 591.179 612.3590 625.84484 612.881 613.8825 6874.428 1000 hm_npjc(m) 62.958 63.8965 64.89338 64.346 65.0550 144.487 1000 hm_omp(m) 91.892 92.6050 165.21507 93.758 98.8230 10026.583 1000 hm_npjc_omp(m) 43.129 43.6820 129.15842 44.458 47.0860 17636.875 1000

OpenMP 的神奇之处在于在编译和链接时包含 -fopenmp(由 Rcpp、plugin="openmp" 处理)，以及 #pragma omp parallel for reduction(+:n_all_true) 计划(静态)在这种情况下，外层循环是并行化的，结果是求和，所以缩减语句告诉编译器分解问题，将每个部分的求和缩减为一个求和。 schedule(static) 描述编译器和/或运行时将如何在线程之间分配循环。在这种情况下，内循环和外循环的宽度都是已知的，所以 static 是首选；例如，如果内部循环大小变化很大，那么 dynamic 可能会更好地平衡线程之间的工作。

可以明确告诉 OpenMP 您希望每个线程执行多少次循环迭代，但通常最好让编译器决定。

另一方面，我努力使用编译器标志，例如 -funroll-loops 来替换内部循环宽度的丑陋但快速的硬编码(这不是一个通用的解决方案到问题)。我测试了这些都无济于事:请参阅 https://github.com/jackwasey/optimization-comparison

关于c++ - 计算 LogicalMatrix R/C++/Rcpp 的全真行的最快方法，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/32810274/

文章推荐： C++ 泛型函数的灵活性

文章推荐： c# - 如何在 MVC6 中注册路由约束

文章推荐： c++ - #include 指令的正确格式

太空狗

个人简介

我是一名优秀的程序员,十分优秀！

作者热门文章

滴滴打车优惠券免费领取

全站热门文章

首页

博学

6Ren·AI

商城