c++ - 使用 cublas 设备 API 计算矩阵行列式-6ren

c++ - 使用 cublas 设备 API 计算矩阵行列式

转载作者：行者123 更新时间：2023-11-28 05:24:16

29

4

我正在尝试计算标量函数 f(x)，其中 x 是 k 维 vector (即 f:R^k->R)。在评估期间，我必须执行许多矩阵运算:求逆、乘法和查找中等大小矩阵(大多数小于 30x30)的矩阵行列式和迹。现在我想通过在 GPU 上使用不同的线程同时在许多不同的 xs 上评估函数。这就是我需要设备 API 的原因。

我编写了以下代码来测试通过 cublas 设备 API cublasSgetrfBatched 计算矩阵行列式，我首先找到矩阵的 LU 分解并计算 U 矩阵中所有对角线元素的乘积。我已经使用 cublas 返回的结果在 GPU 线程和 CPU 上完成了此操作。但是 GPU 的结果没有任何意义，而 CPU 的结果是正确的。我用过cuda-memcheck，没有发现错误。有人可以帮助阐明这个问题吗？非常感谢。

    cat test2.cu

#include <stdio.h>
#include <stdlib.h>
#include <math.h>
#include <cuda_runtime.h>
#include <cublas_v2.h>


__host__ __device__ unsigned int IDX(unsigned int i,unsigned  int j,unsigned int ld){return j*ld+i;}

#define PERR(call) \
  if (call) {\
   fprintf(stderr, "%s:%d Error [%s] on "#call"\n", __FILE__, __LINE__,\
      cudaGetErrorString(cudaGetLastError()));\
   exit(1);\
  }
#define ERRCHECK \
  if (cudaPeekAtLastError()) { \
    fprintf(stderr, "%s:%d Error [%s]\n", __FILE__, __LINE__,\
       cudaGetErrorString(cudaGetLastError()));\
    exit(1);\
  }

__device__ float
det_kernel(float *a_copy,unsigned int *n,cublasHandle_t *hdl){
  int *info = (int *)malloc(sizeof(int));info[0]=0;
  int batch=1;int *p = (int *)malloc(*n*sizeof(int));  
  float **a = (float **)malloc(sizeof(float *));
  *a = a_copy;  
  cublasStatus_t status=cublasSgetrfBatched(*hdl, *n, a, *n, p, info, batch);  
  unsigned int i1;
  float res=1;
  for(i1=0;i1<(*n);++i1)res*=a_copy[IDX(i1,i1,*n)];
  return res;
}

__global__ void runtest(float *a_i,unsigned int n){
  cublasHandle_t hdl;cublasCreate_v2(&hdl);
  printf("det on GPU:%f\n",det_kernel(a_i,&n,&hdl));  
  cublasDestroy_v2(hdl);
}

int
main(int argc, char **argv)
{
  float a[] = {
    1,   2,   3,
    0,   4,   5,
    1,   0,   0};
  cudaSetDevice(1);//GTX780Ti on my machine,0 for GTX1080
  unsigned int n=3,nn=n*n;
  printf("a is \n");
  for (int i = 0; i < n; ++i){    
    for (int j = 0; j < n; j++) printf("%f, ",a[IDX(i,j,n)]);    
    printf("\n");}
  float *a_d;
  PERR(cudaMalloc((void **)&a_d, nn*sizeof(float)));
  PERR(cudaMemcpy(a_d, a, nn*sizeof(float), cudaMemcpyHostToDevice));
  runtest<<<1, 1>>>(a_d,n);
  cudaDeviceSynchronize();
  ERRCHECK;

  PERR(cudaMemcpy(a, a_d, nn*sizeof(float), cudaMemcpyDeviceToHost));
  float res=1;
  for (int i = 0; i < n; ++i)res*=a[IDX(i,i,n)];
  printf("det on CPU:%f\n",res);
}

  nvcc -arch=sm_35 -rdc=true -o test test2.cu -lcublas_device -lcudadevrt
./test
a is 
1.000000, 0.000000, 1.000000, 
2.000000, 4.000000, 0.000000, 
3.000000, 5.000000, 0.000000, 
det on GPU:0.000000
det on CPU:-2.000000

最佳答案

cublas 设备调用是异步的。

这意味着它们在 cublas 调用完成之前将控制权返回给调用线程。

如果您希望调用线程能够直接处理结果(就像您在此处计算 res 时所做的那样)，则必须在开始计算之前强制同步以等待结果。

在主机端计算中看不到这一点，因为在父内核终止之前，任何设备事件(包括 cublas 设备动态并行性)都存在隐式同步。

因此，如果您在设备 cublas 调用之后添加同步，如下所示:

cublasStatus_t status=cublasSgetrfBatched(*hdl, *n, a, *n, p, info, batch); 
cudaDeviceSynchronize(); // add this line

我想您会看到设备计算和主机计算之间的匹配，正如您所期望的那样。

关于c++ - 使用 cublas 设备 API 计算矩阵行列式，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/40864734/

29

4

0

文章推荐： c++ - 遍历列表 vector

文章推荐： c++ - 使用throws来控制程序流程？

文章推荐： c++ - 将数组传递给函数，不输出任何内容

cublas - cublas 内核函数会自动与主机同步吗？
只是一个关于 cublas 的一般问题。对于单线程，如果没有从 GPU 到 CPU 的内存传输(例如 cublasGetVector)，cublas 内核函数(例如 cublasDgemm)是否会自动
CUBLAS 通用矩阵点积
我已经编写了一个struct 和一些包装“CUBLAS 矩阵对象”的函数 struct 是: #include #include #include #define uint unsigned i
cublas 矩阵乘法不符合预期
我正在尝试用 cublas 替换我的 gpu block 矩阵乘法，但我在 2x2 测试用例中没有得到我期望的结果: #include "cuda_runtime.h" #include "cubla
cuBLAS 同步最佳实践
我在 Stack Overflow 上阅读了两篇文章，即 Will the cublas kernel functions automatically be synchronized with the
cuda - 验证是否安装了 CUBLAS
如何检查是否安装了 cuBLAS。有没有一种简单的方法可以使用命令行来完成它而无需实际运行任何 cuda 代码行最佳答案尝试一下 cat /usr/local/cuda/include/cubla
matrix - CUBLAS - 矩阵元素求幂可能吗？
我正在使用 CUBLAS(Cuda Blas 库)进行矩阵运算。是否可以使用 CUBLAS 来实现矩阵项的求幂/均方根？我的意思是，有 2x2 矩阵 1 4 9 16 我想要的是一个提升到给定值的
c++ - CUBLAS 矩阵乘法与行主数据无转置
我目前正尝试在我的 GPU 上使用 CUBLAS 实现矩阵乘法。它适用于方矩阵和特定大小的输入，但对于其他输入，最后一行不会返回(并且包含 0，因为这是我实现它的方式)。我认为这是 cublasS
异步 cuBLAS 调用
我想异步调用 cuBLAS 例程。是否可以？如果是，我怎样才能实现这一目标？最佳答案在 cublas 调用之前使用 cublasSetStream 函数。 cublasSetStream(cubl
cuda - CUBLAS 同步
CUBLAS 文档提到我们在读取标量结果之前需要同步: “此外，少数返回标量结果的函数，例如 amax()、amin、asum()、rotg()、rotmg()、dot() 和 nrm2()，通过引用
cuda - CUBLAS 中的异步和内存所有权
CUBLAS 是一个异步库。传递给 CUBLAS 的参数对内存所有权有什么要求？很明显，在异步调用完成之前，不应释放由 CUBLAS 操作的矩阵 - 但标量参数呢？例如，下面的代码是声音: //.
gpu - cublas 的tensorflow运行错误
当我在集群上成功安装tensorflow时，我立即运行mnist demo来检查它是否顺利，但这里我遇到了一个问题。我不知道这是什么意思，但看起来错误来自 CUDA python3 -m tensor
cuda - CUBLAS 矩阵乘法
使用 CUDA 实现矩阵乘法后。我尝试用CUBLAS实现它(感谢论坛中一些人的建议)。我可以乘方阵，但是(是的，再次......)我在处理非方阵时遇到困难。唯一有效的非方阵乘法类型是当您改变矩阵 A
cuda - CUBLAS:零主元矩阵的不正确反演
从 CUDA 5.5 开始，CUBLAS 库包含用于批量矩阵分解和求逆的例程(分别为 cublasgetrfBatched 和 cublasgetriBatched )。从文档中获取指南，我编写了一
c++ - cuBlas 的不同结果
我已经实现了以下 CUDA 代码，但我对行为有点困惑。 #include #include #include #include #include "cublas_v2.h" #include
输入矩阵也可以用于存储 CUBLAS 的输出矩阵吗？
例如， cublasgeam() 会做: 但是如果我想将结果存储在 A 中怎么办？不管怎样？我可以用指针调用它吗 *C = *A这样: 不用担心我可能会将输出写入矩阵，但仍将其作为输入读取？？如果是
转置时澄清 CUBLAS 中的主要维度
对于矩阵A，documentation仅说明相应的前导维度参数 lda 指的是: leading dimension of two-dimensional array used to store th
cuda - 来自设备的 cublas 矩阵求逆
我正在尝试从设备运行矩阵求逆。如果从主机调用，此逻辑工作正常。编译行如下(Linux): nvcc -ccbin g++ -arch=sm_35 -rdc=true simple-inv.cu -o
cuda - cuBLAS argmin -- 如果输出到设备内存会出现段错误吗？
在 cuBLAS 中，cublasIsamin()给出单精度数组的 argmin。这是完整的函数声明:cublasStatus_t cublasIsamin(cublasHandle_t handl
boost - BLAS 和 CUBLAS
我想知道 NVIDIA 的 cuBLAS 库。有没有人有这方面的经验？例如，如果我使用 BLAS 编写一个 C 程序，我是否能够用对 cuBLAS 的调用替换对 BLAS 的调用？或者甚至更好地实现一
performance - CUBLAS dgemm 性能查询
这些是我在 4 个 GPU 上运行 cublas DGEMM 的结果，每个 GPU 使用 2 个流(Tesla M2050): 我已经测试了我的结果，它们没问题；与使用默认流的版本相比，我担心我获得的

首页

博学

6Ren·AI

商城

c++ - 使用 cublas 设备 API 计算矩阵行列式