c++ - 如何写出可以和Eigen抗衡的matrix矩阵乘积？-6ren

c++ - 如何写出可以和Eigen抗衡的matrix矩阵乘积？

转载作者：可可西里更新时间：2023-11-01 15:52:00

下面是 C++ 实现比较 Eigen 和 For Loop 执行矩阵-矩阵乘积所花费的时间。 For 循环已经过优化以最大限度地减少缓存未命中。 for 循环最初比 Eigen 快，但最终变得更慢(对于 500 x 500 矩阵高达 2 倍)。我还应该怎么做才能与 Eigen 竞争？阻塞是更好的 Eigen 性能的原因吗？如果是这样，我应该如何为 for 循环添加阻塞？

#include<iostream>
#include<Eigen/Dense>
#include<ctime>

int main(int argc, char* argv[]) {
    srand(time(NULL));
    //  Input the size of the matrix from the user
    int N   =   atoi(argv[1]);

    int M   =   N*N;

    //  The matrices stored as row-wise vectors
    double a[M];
    double b[M];
    double c[M];

    //  Initializing Eigen Matrices
    Eigen::MatrixXd a_E =   Eigen::MatrixXd::Random(N,N);
    Eigen::MatrixXd b_E =   Eigen::MatrixXd::Random(N,N);
    Eigen::MatrixXd c_E(N,N);

    double CPS  =   CLOCKS_PER_SEC;

    clock_t start, end;


    //  Matrix vector product by Eigen
    start   =   clock();
    c_E     =   a_E*b_E;
    end     =   clock();

    std::cout << "\nTime taken by Eigen is: " << (end-start)/CPS << "\n";

    //  Initializing For loop vectors
    int count   =   0;
    for (int j=0; j<N; ++j) {
        for (int k=0; k<N; ++k) {
            a[count]    =   a_E(j,k);
            b[count]    =   b_E(j,k);
            ++count;
        }
    }

    //  Matrix vector product by For loop
    start   =   clock();
    count   =   0;
    int count1, count2;
    for (int j=0; j<N; ++j) {
        count1  =   j*N;
        for (int k=0; k<N; ++k) {
            c[count]    =   a[count1]*b[k];
            ++count;
        }
    }

    for (int j=0; j<N; ++j) {
        count2  =   N;
        for (int l=1; l<N; ++l) {
            count   =   j*N;
            count1  =   count+l;
            for (int k=0; k<N; ++k) {
                c[count]+=a[count1]*b[count2];
                ++count;
                ++count2;
            }
        }
    }
    end     =   clock();

    std::cout << "\nTime taken by for-loop is: " << (end-start)/CPS << "\n";
}

最佳答案

无需费解如何实现矩阵-矩阵乘积的高性能实现。事实上，我们需要更多的人了解它，以应对 future 高性能计算的挑战。为了进入这个主题阅读BLIS: A Framework for Rapidly Instantiating BLAS Functionality是一个很好的起点。

所以为了揭秘和回答问题(How to write a matrix matrix product that can competited with Eigen)我把ggael贴的代码加长到总共400行。我刚刚在 AVX 机器(Intel(R) Core(TM) i5-3470 CPU @ 3.20GHz)上对其进行了测试。这里有一些结果:

g++-5.3 -O3 -DNDEBUG -std=c++11 -mavx -m64 -I ../eigen.3.2.8/ gemm.cc -lrt

lehn@heim:~/work/test_eigen$ ./a.out 500
Time taken by Eigen is: 0.0190425
Time taken by for-loop is: 0.0121688

lehn@heim:~/work/test_eigen$ ./a.out 1000
Time taken by Eigen is: 0.147991
Time taken by for-loop is: 0.0959097

lehn@heim:~/work/test_eigen$ ./a.out 1500
Time taken by Eigen is: 0.492858
Time taken by for-loop is: 0.322442

lehn@heim:~/work/test_eigen$ ./a.out 5000
Time taken by Eigen is: 18.3666
Time taken by for-loop is: 12.1023

如果你有 FMA，你可以编译

g++-5.3 -O3 -DNDEBUG -std=c++11 -mfma -m64 -I ../eigen.3.2.8/ -DHAVE_FMA gemm.cc -lrt

如果你还想用 openMP 多线程也用 -fopenmp 编译

此处是基于 BLIS 论文思想的完整代码。它是独立的，除了它需要完整的 Eigen 源文件，正如 ggael 已经指出的那样:

#include<iostream>
#include<Eigen/Dense>
#include<bench/BenchTimer.h>
#if defined(_OPENMP)
#include <omp.h>
#endif
//-- malloc with alignment --------------------------------------------------------
void *
malloc_(std::size_t alignment, std::size_t size)
{
    alignment = std::max(alignment, alignof(void *));
    size     += alignment;

    void *ptr  = std::malloc(size);
    void *ptr2 = (void *)(((uintptr_t)ptr + alignment) & ~(alignment-1));
    void **vp  = (void**) ptr2 - 1;
    *vp        = ptr;
    return ptr2;
}

void
free_(void *ptr)
{
    std::free(*((void**)ptr-1));
}

//-- Config --------------------------------------------------------------------

// SIMD-Register width in bits
// SSE:         128
// AVX/FMA:     256
// AVX-512:     512
#ifndef SIMD_REGISTER_WIDTH
#define SIMD_REGISTER_WIDTH 256
#endif

#ifdef HAVE_FMA

#   ifndef BS_D_MR
#   define BS_D_MR 4
#   endif

#   ifndef BS_D_NR
#   define BS_D_NR 12
#   endif

#   ifndef BS_D_MC
#   define BS_D_MC 256
#   endif

#   ifndef BS_D_KC
#   define BS_D_KC 512
#   endif

#   ifndef BS_D_NC
#   define BS_D_NC 4092
#   endif

#endif



#ifndef BS_D_MR
#define BS_D_MR 4
#endif

#ifndef BS_D_NR
#define BS_D_NR 8
#endif

#ifndef BS_D_MC
#define BS_D_MC 256
#endif

#ifndef BS_D_KC
#define BS_D_KC 256
#endif

#ifndef BS_D_NC
#define BS_D_NC 4096
#endif

template <typename T>
struct BlockSize
{
    static constexpr int MC = 64;
    static constexpr int KC = 64;
    static constexpr int NC = 256;
    static constexpr int MR = 8;
    static constexpr int NR = 8;

    static constexpr int rwidth = 0;
    static constexpr int align  = alignof(T);
    static constexpr int vlen   = 0;

    static_assert(MC>0 && KC>0 && NC>0 && MR>0 && NR>0, "Invalid block size.");
    static_assert(MC % MR == 0, "MC must be a multiple of MR.");
    static_assert(NC % NR == 0, "NC must be a multiple of NR.");
};


template <>
struct BlockSize<double>
{
    static constexpr int MC     = BS_D_MC;
    static constexpr int KC     = BS_D_KC;
    static constexpr int NC     = BS_D_NC;
    static constexpr int MR     = BS_D_MR;
    static constexpr int NR     = BS_D_NR;

    static constexpr int rwidth = SIMD_REGISTER_WIDTH;
    static constexpr int align  = rwidth / 8;
    static constexpr int vlen   = rwidth / (8*sizeof(double));

    static_assert(MC>0 && KC>0 && NC>0 && MR>0 && NR>0, "Invalid block size.");
    static_assert(MC % MR == 0, "MC must be a multiple of MR.");
    static_assert(NC % NR == 0, "NC must be a multiple of NR.");
    static_assert(rwidth % sizeof(double) == 0, "SIMD register width not sane.");
};

//-- aux routines --------------------------------------------------------------
template <typename Index, typename Alpha, typename TX, typename TY>
void
geaxpy(Index m, Index n,
       const Alpha &alpha,
       const TX *X, Index incRowX, Index incColX,
       TY       *Y, Index incRowY, Index incColY)
{
    for (Index j=0; j<n; ++j) {
        for (Index i=0; i<m; ++i) {
            Y[i*incRowY+j*incColY] += alpha*X[i*incRowX+j*incColX];
        }
    }
}

template <typename Index, typename Alpha, typename TX>
void
gescal(Index m, Index n,
       const Alpha &alpha,
       TX *X, Index incRowX, Index incColX)
{
    if (alpha!=Alpha(0)) {
        for (Index j=0; j<n; ++j) {
            for (Index i=0; i<m; ++i) {
                X[i*incRowX+j*incColX] *= alpha;
            }
        }
    } else {
        for (Index j=0; j<n; ++j) {
            for (Index i=0; i<m; ++i) {
                X[i*incRowX+j*incColX] = Alpha(0);
            }
        }
    }
}


//-- Micro Kernel --------------------------------------------------------------
template <typename Index, typename T>
typename std::enable_if<BlockSize<T>::vlen != 0,
         void>::type
ugemm(Index kc, T alpha, const T *A, const T *B, T beta,
      T *C, Index incRowC, Index incColC)
{
    typedef T vx __attribute__((vector_size (BlockSize<T>::rwidth/8)));

    static constexpr Index vlen = BlockSize<T>::vlen;
    static constexpr Index MR   = BlockSize<T>::MR;
    static constexpr Index NR   = BlockSize<T>::NR/vlen;

    A = (const T*) __builtin_assume_aligned (A, BlockSize<T>::align);
    B = (const T*) __builtin_assume_aligned (B, BlockSize<T>::align);

    vx P[MR*NR] = {};

    for (Index l=0; l<kc; ++l) {
        const vx *b = (const vx *)B;
        for (Index i=0; i<MR; ++i) {
            for (Index j=0; j<NR; ++j) {
                P[i*NR+j] += A[i]*b[j];
            }
        }
        A += MR;
        B += vlen*NR;
    }

    if (alpha!=T(1)) {
        for (Index i=0; i<MR; ++i) {
            for (Index j=0; j<NR; ++j) {
                P[i*NR+j] *= alpha;
            }
        }
    }

    if (beta!=T(0)) {
        for (Index i=0; i<MR; ++i) {
            for (Index j=0; j<NR; ++j) {
                const T *p = (const T *) &P[i*NR+j];
                for (Index j1=0; j1<vlen; ++j1) {
                    C[i*incRowC+(j*vlen+j1)*incColC] *= beta;
                    C[i*incRowC+(j*vlen+j1)*incColC] += p[j1];
                }
            }
        }
    } else {
        for (Index i=0; i<MR; ++i) {
            for (Index j=0; j<NR; ++j) {
                const T *p = (const T *) &P[i*NR+j];
                for (Index j1=0; j1<vlen; ++j1) {
                    C[i*incRowC+(j*vlen+j1)*incColC] = p[j1];
                }
            }
        }
    }
}

//-- Macro Kernel --------------------------------------------------------------
template <typename Index, typename T, typename Beta, typename TC>
void
mgemm(Index mc, Index nc, Index kc,
      T alpha,
      const T *A, const T *B,
      Beta beta,
      TC *C, Index incRowC, Index incColC)
{
    const Index MR = BlockSize<T>::MR;
    const Index NR = BlockSize<T>::NR;
    const Index mp  = (mc+MR-1) / MR;
    const Index np  = (nc+NR-1) / NR;
    const Index mr_ = mc % MR;
    const Index nr_ = nc % NR;

    T C_[MR*NR];

    #pragma omp parallel for
    for (Index j=0; j<np; ++j) {
        const Index nr = (j!=np-1 || nr_==0) ? NR : nr_;

        for (Index i=0; i<mp; ++i) {
            const Index mr = (i!=mp-1 || mr_==0) ? MR : mr_;

            if (mr==MR && nr==NR) {
                ugemm(kc, alpha,
                      &A[i*kc*MR], &B[j*kc*NR],
                      beta,
                      &C[i*MR*incRowC+j*NR*incColC],
                      incRowC, incColC);
            } else {
                ugemm(kc, alpha,
                      &A[i*kc*MR], &B[j*kc*NR],
                      T(0),
                      C_, Index(1), MR);
                gescal(mr, nr, beta,
                       &C[i*MR*incRowC+j*NR*incColC],
                       incRowC, incColC);
                geaxpy(mr, nr, T(1), C_, Index(1), MR,
                       &C[i*MR*incRowC+j*NR*incColC],
                       incRowC, incColC);
            }
        }
    }
}
//-- Packing blocks ------------------------------------------------------------
template <typename Index, typename TA, typename T>
void
pack_A(Index mc, Index kc,
       const TA *A, Index incRowA, Index incColA,
       T *p)
{
    Index MR = BlockSize<T>::MR;
    Index mp = (mc+MR-1) / MR;

    for (Index j=0; j<kc; ++j) {
        for (Index l=0; l<mp; ++l) {
            for (Index i0=0; i0<MR; ++i0) {
                Index i  = l*MR + i0;
                Index nu = l*MR*kc + j*MR + i0;
                p[nu]   = (i<mc) ? A[i*incRowA+j*incColA]
                                 : T(0);
            }
        }
    }
}

template <typename Index, typename TB, typename T>
void
pack_B(Index kc, Index nc,
       const TB *B, Index incRowB, Index incColB,
       T *p)
{
    Index NR = BlockSize<T>::NR;
    Index np = (nc+NR-1) / NR;

    for (Index l=0; l<np; ++l) {
        for (Index j0=0; j0<NR; ++j0) {
            for (Index i=0; i<kc; ++i) {
                Index j  = l*NR+j0;
                Index nu = l*NR*kc + i*NR + j0;
                p[nu]   = (j<nc) ? B[i*incRowB+j*incColB]
                                 : T(0);
            }
        }
    }
}
//-- Frame routine -------------------------------------------------------------
template <typename Index, typename Alpha,
         typename TA, typename TB,
         typename Beta,
         typename TC>
void
gemm(Index m, Index n, Index k,
     Alpha alpha,
     const TA *A, Index incRowA, Index incColA,
     const TB *B, Index incRowB, Index incColB,
     Beta beta,
     TC *C, Index incRowC, Index incColC)
{
    typedef typename std::common_type<Alpha, TA, TB>::type  T;

    const Index MC = BlockSize<T>::MC;
    const Index NC = BlockSize<T>::NC;
    const Index MR = BlockSize<T>::MR;
    const Index NR = BlockSize<T>::NR;

    const Index KC = BlockSize<T>::KC;
    const Index mb = (m+MC-1) / MC;
    const Index nb = (n+NC-1) / NC;
    const Index kb = (k+KC-1) / KC;
    const Index mc_ = m % MC;
    const Index nc_ = n % NC;
    const Index kc_ = k % KC;

    T *A_ = (T*) malloc_(BlockSize<T>::align, sizeof(T)*(MC*KC+MR));
    T *B_ = (T*) malloc_(BlockSize<T>::align, sizeof(T)*(KC*NC+NR));

    if (alpha==Alpha(0) || k==0) {
        gescal(m, n, beta, C, incRowC, incColC);
        return;
    }

    for (Index j=0; j<nb; ++j) {
        Index nc = (j!=nb-1 || nc_==0) ? NC : nc_;

        for (Index l=0; l<kb; ++l) {
            Index   kc  = (l!=kb-1 || kc_==0) ? KC : kc_;
            Beta beta_  = (l==0) ? beta : Beta(1);

            pack_B(kc, nc,
                   &B[l*KC*incRowB+j*NC*incColB],
                   incRowB, incColB,
                   B_);

            for (Index i=0; i<mb; ++i) {
                Index mc = (i!=mb-1 || mc_==0) ? MC : mc_;

                pack_A(mc, kc,
                       &A[i*MC*incRowA+l*KC*incColA],
                       incRowA, incColA,
                       A_);

                mgemm(mc, nc, kc,
                      T(alpha), A_, B_, beta_,
                      &C[i*MC*incRowC+j*NC*incColC],
                      incRowC, incColC);
            }
        }
    }
    free_(A_);
    free_(B_);
}

//------------------------------------------------------------------------------

void myprod(double *c, const double* a, const double* b, int N) {
    gemm(N, N, N, 1.0, a, 1, N, b, 1, N, 0.0, c, 1, N);
}

int main(int argc, char* argv[]) {
  int N = atoi(argv[1]);
  int tries = 4;
  int rep = std::max<int>(1,10000000/N/N/N);

  Eigen::MatrixXd a_E = Eigen::MatrixXd::Random(N,N);
  Eigen::MatrixXd b_E = Eigen::MatrixXd::Random(N,N);
  Eigen::MatrixXd c_E(N,N);

  Eigen::BenchTimer t1, t2;

  BENCH(t1, tries, rep, c_E.noalias() = a_E*b_E );
  BENCH(t2, tries, rep, myprod(c_E.data(), a_E.data(), b_E.data(), N));

  std::cout << "Time taken by Eigen is: " << t1.best() << "\n";
  std::cout << "Time taken by for-loop is: " << t2.best() << "\n\n";
}

关于c++ - 如何写出可以和Eigen抗衡的matrix矩阵乘积？，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/35620853/

文章推荐： c++ - 如何使用较旧的 C++ 标准编译 Boost？ (特别是 C++03)

文章推荐： c++ - 和有什么区别？

文章推荐： c++ - `std::alignment_of` 与 `alignof`

linux - 写出 libjpeg 图像时内存泄漏
我在一个本土 C++ 框架内工作，其中一个类间接连接到 libjpeg-8c.so(作为 Ubuntu 16.04 突触包获得)。我在我的应用程序上运行 valgrind，它最终会写出图像数据，如下所
java - 写出 Tomcat 响应头
上下文我正在运行一个 Tomcat 8.5 服务器，前端有一个 Nginx 反向代理来终止 SSL 连接，启用更多压缩。在 Tomcat 服务器上，我有一个正在运行的 Web 应用程序，其中包含一
javascript - 如何用 JavaScript 写出 "array of pattern"？
我正在尝试让 grunt-jsdoc-plugin 正常工作，但遇到了一个小问题。在我的控制台中，我不断收到: Running "jsdoc:dist" (jsdoc) task Warning: C
c# - 写出 zip 文件在 IE7 中不起作用
我继承了一个在数据库中存储 zip 文件的旧应用程序，需要检索该文件。在 Firefox 中运行良好，我可以打开 zip 并且其中的每个文件都很好。当我在 IE7 中运行它时，出现以下错误。 Inte
PHP CMS 教程，写出 html 文件
我想创建一个能够写出平面 html 文件的 cms，因此不需要数据库参与。这个想法是CMS将允许编辑和更新文件(用php编写，如果需要的话可以使用mysql数据库)，然后将这些更改保存/写出到htm
php - 写出 html 而不是动态创建元素时出现 Javascript 错误
我有一个 javascript 函数，当通过 javascript 添加 td 元素时，它可以在 onclick 上正常运行。删除按钮工作正常。但是当我使用 php 创建元素并单击“删除”时，我得到:
javascript - nodejs 服务器，写出 png 到响应页面
我正在使用 node-png 库制作 png，然后将其保存在本地目录中，但是当我重新打开它时，它说它不存在。我想读入数据并将其发送出去，或者只是让响应发送一个带图片的字段。这是我到目前为止所拥有的:
javascript - 写出 base 7 word 的函数，如二进制计数器样式
我需要一个类似于此处解释的函数... JS function for writing out a word, binary counter style ...但使用基数 7(或其他)生成(计数)从 A
python - 如何强制 matplotlib 写出 x 轴标签的完整形式，避免使用科学记数法？
我使用 matplotlib.pyplot 创建了一个简单的 hexbin 图。我没有更改任何默认设置。我的 x 轴信息范围从 2003 到 2009，而 y 值范围从 15 到 35。matplot
c++ - .读入实数，直到找到 10 个正值。写出 10 个正数的总和
这是我的代码的重要部分: int realnum, positive = 0, total, poscount; for (poscount = 1; poscount > realnum;
julia - 如何将分区的 Apache Arrow 或 Parquet 文件读入/写出 Julia
我正在尝试在 Julia 中读取和写入一个简单的数据集。数据集是 mtcars ，取自 R，任意添加一列 bt带有随机 bool 值。文件/文件夹结构(如下)是使用 R arrow 写出的。包裹。文
python - 有没有人能够使用 python 的 xlwt 写出 UTF-8 字符？
我正在尝试将数据写入包含日语字符的 Excel 文件。我正在使用 codec.open() 来获取数据，这似乎工作正常，但是当我尝试写入数据时遇到了这个错误: UnicodeEncodeError:

可可西里

个人简介

我是一名优秀的程序员,十分优秀！

作者热门文章

滴滴打车优惠券免费领取

全站热门文章

首页

博学

6Ren·AI

商城

c++ - 如何写出可以和Eigen抗衡的matrix矩阵乘积？