cuda - 奇怪的 cuBLAS gemm 批处理性能-6ren

cuda - 奇怪的 cuBLAS gemm 批处理性能

转载作者：行者123 更新时间：2023-12-02 07:32:07

我注意到 cublasSgemmStridedBatched 的一些奇怪的表现，我正在寻找解释。矩阵大小固定为 20x20。以下是几种不同批量大小的一些时序(仅乘法，无数据传输):

批处理 = 100，时间 = 0.2 毫秒
批处理 = 1,000，时间 = 1.9 毫秒
批处理 = 10,000，时间 = 18.3 毫秒
批处理 = 100,000，时间 = 5.3 毫秒
批处理 = 1,000,000，时间 = 52.8 毫秒

前几个批量大小正如我所期望的那样，随着批量大小增加十倍，时间线性增加。然而，使用 100,000 个矩阵突然会出现 3.4 倍的加速吗？

如果矩阵大小固定为10x10并且再次执行试验我发现:

批处理 = 100，时间 = 0.2 毫秒
批处理 = 1,000，时间 = 2.0 毫秒
批处理 = 10,000，时间 = 20.0 毫秒
批处理 = 100,000，时间 = 0.9 毫秒
批处理 = 1,000,000，时间 = 8.9 毫秒

同样，批量大小为 100,000 时速度竟然意外提高了 22 倍？让我想知道为什么批量大小为 1,000 和 10,000 比批量大小 100,000 慢，因为矩阵大小仍然是 10x10。

不同的批量大小是否使用不同的算法？这个表现我觉得很奇怪。当我使用 cublasSgemmBatched 进行此试验时，会发生类似的结果。这些试验是在 GeForce GTX 1080 Ti 上执行的。授予最少的工作代码:

#include <stdio.h>
#include <stdlib.h>
#include "math.h"
#include "cublas_v2.h" 
//nvcc -lcublas cublas.c -o cublas.out

int main(int argc, char* argv[])
{
int i,j,k,index;

// Linear dimension of matrices
int dim = 20;
int batch_count = 10*10*10*10*10*1;
// Allocate host storage for batch_count A,B,C square matrices
float* h_A = malloc(sizeof(float) * dim * dim * batch_count);
float* h_B = malloc(sizeof(float) * dim * dim * batch_count);
float* h_C = malloc(sizeof(float) * dim * dim * batch_count);
    for(k=0; k<batch_count; k++) {
        for(j=0; j<dim; j++) {
                for(i=0; i<dim; i++) {
                index = i*dim + j + k*dim*dim;
                  h_A[index] = index*index + 0.0f;
                  h_B[index] = index + 1.0f;
                  h_C[index] = 0.0f;
        }
    }
}


float *d_A, *d_B, *d_C;
cudaMalloc(&d_A, sizeof(float) * dim * dim * batch_count);
cudaMalloc(&d_B, sizeof(float) * dim * dim * batch_count);
cudaMalloc(&d_C, sizeof(float) * dim * dim * batch_count);
cudaMemcpy(h_A,d_A,sizeof(float) * dim * dim * batch_count,cudaMemcpyDeviceToHost);
cudaMemcpy(h_B,d_B,sizeof(float) * dim * dim * batch_count,cudaMemcpyDeviceToHost);
cudaMemcpy(h_C,d_C,sizeof(float) * dim * dim * batch_count,cudaMemcpyDeviceToHost);

cublasHandle_t handle;
cublasCreate(&handle);

// Do the actual multiplication 
float time_cuda_event;
cudaEvent_t start, stop;    
cudaEventCreate(&start);
cudaEventCreate(&stop) ;
cudaEventRecord(start, 0);
float alpha = 1.0f;  float beta = 1.0f;
cublasSgemmStridedBatched(handle,
                              CUBLAS_OP_N, 
                              CUBLAS_OP_N,
                              dim, dim, dim,
                              &alpha,
                              (const float*)d_A, dim,
                              dim*dim,
                              (const float*)d_B, dim,
                              dim*dim,
                              &beta,
                              d_C, dim, 
                              dim*dim, 
                              batch_count);
( cudaEventRecord(stop, 0) );
( cudaEventSynchronize(stop) );
( cudaEventElapsedTime(&time_cuda_event, start, stop) );              
printf("Time :  %3.1f ms \n", time_cuda_event);  

cudaMemcpy(h_C,d_C,sizeof(float) * dim * dim * batch_count,cudaMemcpyDeviceToHost);
// Destroy the handle
cublasDestroy(handle);


cudaFree(d_A);
cudaFree(d_B);
cudaFree(d_C);
free(h_A);
free(h_B);
free(h_C);
    return 0;
}

最佳答案

这似乎只是 CUBLAS 内部启发式的结果。如果我运行代码的修改(和工作)版本，我会得到 5x5 情况下的这些计时:

Batch size :           10   Time :  0.019104 ms 
Batch size :          100   Time :  0.038304 ms 
Batch size :         1000   Time :  0.163520 ms 
Batch size :        10000   Time :  1.410944 ms 
Batch size :       100000   Time :  1.614144 ms 
Batch size :      1000000   Time :  16.057407 ms

分析显示，在多达 10000 个条目的批处理的情况下，该库运行一个内核:

1.10759s  16.831us             (1 1 10)       (128 1 1)       120  12.250KB        0B         -           -           -           -  GeForce GTX 970         1         7  maxwell_sgemm_128x64_nn [3939]
1.10766s  19.168us            (1 1 100)       (128 1 1)       120  12.250KB        0B         -           -           -           -  GeForce GTX 970         1         7  maxwell_sgemm_128x64_nn [3971]
1.10773s  147.71us           (1 1 1000)       (128 1 1)       120  12.250KB        0B         -           -           -           -  GeForce GTX 970         1         7  maxwell_sgemm_128x64_nn [4003]
1.10791s  1.4064ms          (1 1 10000)       (128 1 1)       120  12.250KB        0B         -           -           -           -  GeForce GTX 970         1         7  maxwell_sgemm_128x64_nn [4035]

当尺寸较大时，它会运行对另一个内核的多次调用来为调用提供服务:

1.10935s  1.1518ms          (1 1 65535)       (16 16 1)        31  2.1250KB        0B         -           -           -           -  GeForce GTX 970         1         7  void batch_gemm_kernel1x1_core<float, float, float, bool=0, bool=0, bool=0, bool=0, bool=0, bool=1, bool=1>(float* const *, float const * const *, float const * const *, float*, float const *, float const *, int, int, int, int, int, int, __int64, __int64, __int64, float const *, float const *, float, float, int, int) [4063]
1.11050s  606.54us          (1 1 34465)       (16 16 1)        31  2.1250KB        0B         -           -           -           -  GeForce GTX 970         1         7  void batch_gemm_kernel1x1_core<float, float, float, bool=0, bool=0, bool=0, bool=0, bool=0, bool=1, bool=1>(float* const *, float const * const *, float const * const *, float*, float const *, float const *, int, int, int, int, int, int, __int64, __int64, __int64, float const *, float const *, float, float, int, int) [4087]
1.11113s  1.1498ms          (1 1 65535)       (16 16 1)        31  2.1250KB        0B         -           -           -           -  GeForce GTX 970         1         7  void batch_gemm_kernel1x1_core<float, float, float, bool=0, bool=0, bool=0, bool=0, bool=0, bool=1, bool=1>(float* const *, float const * const *, float const * const *, float*, float const *, float const *, int, int, int, int, int, int, __int64, __int64, __int64, float const *, float const *, float, float, int, int) [4115]
1.11228s  1.1501ms          (1 1 65535)       (16 16 1)        31  2.1250KB        0B         -           -           -           -  GeForce GTX 970         1         7  void batch_gemm_kernel1x1_core<float, float, float, bool=0, bool=0, bool=0, bool=0, bool=0, bool=1, bool=1>(float* const *, float const * const *, float const * const *, float*, float const *, float const *, int, int, int, int, int, int, __int64, __int64, __int64, float const *, float const *, float, float, int, int) [4139]
1.11344s  1.1511ms          (1 1 65535)       (16 16 1)        31  2.1250KB        0B         -           -           -           -  GeForce GTX 970         1         7  void batch_gemm_kernel1x1_core<float, float, float, bool=0, bool=0, bool=0, bool=0, bool=0, bool=1, bool=1>(float* const *, float const * const *, float const * const *, float*, float const *, float const *, int, int, int, int, int, int, __int64, __int64, __int64, float const *, float const *, float, float, int, int) [4163]
1.11459s  1.1494ms          (1 1 65535)       (16 16 1)        31  2.1250KB        0B         -           -           -           -  GeForce GTX 970         1         7  void batch_gemm_kernel1x1_core<float, float, float, bool=0, bool=0, bool=0, bool=0, bool=0, bool=1, bool=1>(float* const *, float const * const *, float const * const *, float*, float const *, float const *, int, int, int, int, int, int, __int64, __int64, __int64, float const *, float const *, float, float, int, int) [4187]
1.11574s  1.1507ms          (1 1 65535)       (16 16 1)        31  2.1250KB        0B         -           -           -           -  GeForce GTX 970         1         7  void batch_gemm_kernel1x1_core<float, float, float, bool=0, bool=0, bool=0, bool=0, bool=0, bool=1, bool=1>(float* const *, float const * const *, float const * const *, float*, float const *, float const *, int, int, int, int, int, int, __int64, __int64, __int64, float const *, float const *, float, float, int, int) [4211]
1.11689s  1.1503ms          (1 1 65535)       (16 16 1)        31  2.1250KB        0B         -           -           -           -  GeForce GTX 970         1         7  void batch_gemm_kernel1x1_core<float, float, float, bool=0, bool=0, bool=0, bool=0, bool=0, bool=1, bool=1>(float* const *, float const * const *, float const * const *, float*, float const *, float const *, int, int, int, int, int, int, __int64, __int64, __int64, float const *, float const *, float, float, int, int) [4235]
1.11804s  1.1499ms          (1 1 65535)       (16 16 1)        31  2.1250KB        0B         -           -           -           -  GeForce GTX 970         1         7  void batch_gemm_kernel1x1_core<float, float, float, bool=0, bool=0, bool=0, bool=0, bool=0, bool=1, bool=1>(float* const *, float const * const *, float const * const *, float*, float const *, float const *, int, int, int, int, int, int, __int64, __int64, __int64, float const *, float const *, float, float, int, int) [4259]
1.11919s  1.1507ms          (1 1 65535)       (16 16 1)        31  2.1250KB        0B         -           -           -           -  GeForce GTX 970         1         7  void batch_gemm_kernel1x1_core<float, float, float, bool=0, bool=0, bool=0, bool=0, bool=0, bool=1, bool=1>(float* const *, float const * const *, float const * const *, float*, float const *, float const *, int, int, int, int, int, int, __int64, __int64, __int64, float const *, float const *, float, float, int, int) [4283]
1.12035s  1.1507ms          (1 1 65535)       (16 16 1)        31  2.1250KB        0B         -           -           -           -  GeForce GTX 970         1         7  void batch_gemm_kernel1x1_core<float, float, float, bool=0, bool=0, bool=0, bool=0, bool=0, bool=1, bool=1>(float* const *, float const * const *, float const * const *, float*, float const *, float const *, int, int, int, int, int, int, __int64, __int64, __int64, float const *, float const *, float, float, int, int) [4307]
1.12150s  1.1509ms          (1 1 65535)       (16 16 1)        31  2.1250KB        0B         -           -           -           -  GeForce GTX 970         1         7  void batch_gemm_kernel1x1_core<float, float, float, bool=0, bool=0, bool=0, bool=0, bool=0, bool=1, bool=1>(float* const *, float const * const *, float const * const *, float*, float const *, float const *, int, int, int, int, int, int, __int64, __int64, __int64, float const *, float const *, float, float, int, int) [4331]
1.12265s  1.1489ms          (1 1 65535)       (16 16 1)        31  2.1250KB        0B         -           -           -           -  GeForce GTX 970         1         7  void batch_gemm_kernel1x1_core<float, float, float, bool=0, bool=0, bool=0, bool=0, bool=0, bool=1, bool=1>(float* const *, float const * const *, float const * const *, float*, float const *, float const *, int, int, int, int, int, int, __int64, __int64, __int64, float const *, float const *, float, float, int, int) [4355]
1.12380s  1.1496ms          (1 1 65535)       (16 16 1)        31  2.1250KB        0B         -           -           -           -  GeForce GTX 970         1         7  void batch_gemm_kernel1x1_core<float, float, float, bool=0, bool=0, bool=0, bool=0, bool=0, bool=1, bool=1>(float* const *, float const * const *, float const * const *, float*, float const *, float const *, int, int, int, int, int, int, __int64, __int64, __int64, float const *, float const *, float, float, int, int) [4379]
1.12495s  1.1500ms          (1 1 65535)       (16 16 1)        31  2.1250KB        0B         -           -           -           -  GeForce GTX 970         1         7  void batch_gemm_kernel1x1_core<float, float, float, bool=0, bool=0, bool=0, bool=0, bool=0, bool=1, bool=1>(float* const *, float const * const *, float const * const *, float*, float const *, float const *, int, int, int, int, int, int, __int64, __int64, __int64, float const *, float const *, float, float, int, int) [4403]
1.12610s  1.1494ms          (1 1 65535)       (16 16 1)        31  2.1250KB        0B         -           -           -           -  GeForce GTX 970         1         7  void batch_gemm_kernel1x1_core<float, float, float, bool=0, bool=0, bool=0, bool=0, bool=0, bool=1, bool=1>(float* const *, float const * const *, float const * const *, float*, float const *, float const *, int, int, int, int, int, int, __int64, __int64, __int64, float const *, float const *, float, float, int, int) [4427]
1.12726s  1.1503ms          (1 1 65535)       (16 16 1)        31  2.1250KB        0B         -           -           -           -  GeForce GTX 970         1         7  void batch_gemm_kernel1x1_core<float, float, float, bool=0, bool=0, bool=0, bool=0, bool=0, bool=1, bool=1>(float* const *, float const * const *, float const * const *, float*, float const *, float const *, int, int, int, int, int, int, __int64, __int64, __int64, float const *, float const *, float, float, int, int) [4451]
1.12841s  299.35us          (1 1 16975)       (16 16 1)        31  2.1250KB        0B         -           -           -           -  GeForce GTX 970         1         7  void batch_gemm_kernel1x1_core<float, float, float, bool=0, bool=0, bool=0, bool=0, bool=0, bool=1, bool=1>(float* const *, float const * const *, float const * const *, float*, float const *, float const *, int, int, int, int, int, int, __int64, __int64, __int64, float const *, float const *, float, float, int, int) [4475]

您观察到的不一致似乎是由库内从一个内核到另一个内核的更改引起的，这可能是由某些批量大小标准造成的。您可以看到，两个内核似乎每个批处理项都使用一个 block ，较大尺寸的内核使用具有 256 个线程的 2D block ，而较小尺寸的内核使用具有 128 个线程的 1D block 。除此之外，性能差异取决于内部实现细节。尽管这样做可能违反最终用户许可，但如果您想了解更多信息，您将需要反汇编内核并查看它们是如何工作的。该工具包包含执行此操作所需的所有工具，但我并不建议您这样做。

关于cuda - 奇怪的 cuBLAS gemm 批处理性能，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/48519861/

文章推荐： ios - 使用 Cordova/Ionic 的移动 Web 受众网络

文章推荐： xml - 在 Oracle 9i 中解析 Web 服务响应

文章推荐： azure - 从 1000 台设备获取连接状态的最佳方式 - Azure IoTHub

文章推荐： unix - 如何确定终端的背景颜色？

cublas - cublas 内核函数会自动与主机同步吗？
只是一个关于 cublas 的一般问题。对于单线程，如果没有从 GPU 到 CPU 的内存传输(例如 cublasGetVector)，cublas 内核函数(例如 cublasDgemm)是否会自动
CUBLAS 通用矩阵点积
我已经编写了一个struct 和一些包装“CUBLAS 矩阵对象”的函数 struct 是: #include #include #include #define uint unsigned i
cublas 矩阵乘法不符合预期
我正在尝试用 cublas 替换我的 gpu block 矩阵乘法，但我在 2x2 测试用例中没有得到我期望的结果: #include "cuda_runtime.h" #include "cubla
cuBLAS 同步最佳实践
我在 Stack Overflow 上阅读了两篇文章，即 Will the cublas kernel functions automatically be synchronized with the
cuda - 验证是否安装了 CUBLAS
如何检查是否安装了 cuBLAS。有没有一种简单的方法可以使用命令行来完成它而无需实际运行任何 cuda 代码行最佳答案尝试一下 cat /usr/local/cuda/include/cubla
matrix - CUBLAS - 矩阵元素求幂可能吗？
我正在使用 CUBLAS(Cuda Blas 库)进行矩阵运算。是否可以使用 CUBLAS 来实现矩阵项的求幂/均方根？我的意思是，有 2x2 矩阵 1 4 9 16 我想要的是一个提升到给定值的
c++ - CUBLAS 矩阵乘法与行主数据无转置
我目前正尝试在我的 GPU 上使用 CUBLAS 实现矩阵乘法。它适用于方矩阵和特定大小的输入，但对于其他输入，最后一行不会返回(并且包含 0，因为这是我实现它的方式)。我认为这是 cublasS
异步 cuBLAS 调用
我想异步调用 cuBLAS 例程。是否可以？如果是，我怎样才能实现这一目标？最佳答案在 cublas 调用之前使用 cublasSetStream 函数。 cublasSetStream(cubl
cuda - CUBLAS 同步
CUBLAS 文档提到我们在读取标量结果之前需要同步: “此外，少数返回标量结果的函数，例如 amax()、amin、asum()、rotg()、rotmg()、dot() 和 nrm2()，通过引用
cuda - CUBLAS 中的异步和内存所有权
CUBLAS 是一个异步库。传递给 CUBLAS 的参数对内存所有权有什么要求？很明显，在异步调用完成之前，不应释放由 CUBLAS 操作的矩阵 - 但标量参数呢？例如，下面的代码是声音: //.
gpu - cublas 的tensorflow运行错误
当我在集群上成功安装tensorflow时，我立即运行mnist demo来检查它是否顺利，但这里我遇到了一个问题。我不知道这是什么意思，但看起来错误来自 CUDA python3 -m tensor
cuda - CUBLAS 矩阵乘法
使用 CUDA 实现矩阵乘法后。我尝试用CUBLAS实现它(感谢论坛中一些人的建议)。我可以乘方阵，但是(是的，再次......)我在处理非方阵时遇到困难。唯一有效的非方阵乘法类型是当您改变矩阵 A
cuda - CUBLAS:零主元矩阵的不正确反演
从 CUDA 5.5 开始，CUBLAS 库包含用于批量矩阵分解和求逆的例程(分别为 cublasgetrfBatched 和 cublasgetriBatched )。从文档中获取指南，我编写了一
c++ - cuBlas 的不同结果
我已经实现了以下 CUDA 代码，但我对行为有点困惑。 #include #include #include #include #include "cublas_v2.h" #include
输入矩阵也可以用于存储 CUBLAS 的输出矩阵吗？
例如， cublasgeam() 会做: 但是如果我想将结果存储在 A 中怎么办？不管怎样？我可以用指针调用它吗 *C = *A这样: 不用担心我可能会将输出写入矩阵，但仍将其作为输入读取？？如果是
转置时澄清 CUBLAS 中的主要维度
对于矩阵A，documentation仅说明相应的前导维度参数 lda 指的是: leading dimension of two-dimensional array used to store th
cuda - 来自设备的 cublas 矩阵求逆
我正在尝试从设备运行矩阵求逆。如果从主机调用，此逻辑工作正常。编译行如下(Linux): nvcc -ccbin g++ -arch=sm_35 -rdc=true simple-inv.cu -o
cuda - cuBLAS argmin -- 如果输出到设备内存会出现段错误吗？
在 cuBLAS 中，cublasIsamin()给出单精度数组的 argmin。这是完整的函数声明:cublasStatus_t cublasIsamin(cublasHandle_t handl
boost - BLAS 和 CUBLAS
我想知道 NVIDIA 的 cuBLAS 库。有没有人有这方面的经验？例如，如果我使用 BLAS 编写一个 C 程序，我是否能够用对 cuBLAS 的调用替换对 BLAS 的调用？或者甚至更好地实现一
performance - CUBLAS dgemm 性能查询
这些是我在 4 个 GPU 上运行 cublas DGEMM 的结果，每个 GPU 使用 2 个流(Tesla M2050): 我已经测试了我的结果，它们没问题；与使用默认流的版本相比，我担心我获得的

行者123

个人简介

我是一名优秀的程序员,十分优秀！

作者热门文章

滴滴打车优惠券免费领取

全站热门文章

首页

博学

6Ren·AI

商城

cuda - 奇怪的 cuBLAS gemm 批处理性能