gpt4 book ai didi

python - cublasXt 矩阵乘法在 C++ 中成功,在 Python 中失败

转载 作者:行者123 更新时间:2023-12-01 02:25:26 24 4
gpt4 key购买 nike

我正在尝试包装 cublasXt*gemm在 Ubuntu Linux 16.04 上的 Python 2.7.14 中使用 ctypess 在 CUDA 9.0 中运行。这些函数接受主机内存中的数组作为它们的一些参数。我已经能够在 C++ 中成功使用它们,如下所示:

#include <iostream>
#include <cstdlib>
#include "cublasXt.h"
#include "cuda_runtime_api.h"

void rand_mat(float* &x, int m, int n) {
x = new float[m*n];
for (int i=0; i<m; ++i) {
for (int j=0; j<n; ++j) {
x[i*n+j] = ((float)rand())/RAND_MAX;
}
}
}

int main(void) {
cublasXtHandle_t handle;
cublasXtCreate(&handle);

int devices[1] = {0};
if (cublasXtDeviceSelect(handle, 1, devices) !=
CUBLAS_STATUS_SUCCESS) {
std::cout << "initialization failed" << std::endl;
return 1;
}

float *a, *b, *c;
int m = 4, n = 4, k = 4;

rand_mat(a, m, k);
rand_mat(b, k, n);
rand_mat(c, m, n);

float alpha = 1.0;
float beta = 0.0;

if (cublasXtSgemm(handle, CUBLAS_OP_N, CUBLAS_OP_N,
m, n, k, &alpha, a, m, b, k, &beta, c, m) !=
CUBLAS_STATUS_SUCCESS) {
std::cout << "matrix multiply failed" << std::endl;
return 1;
}
delete a; delete b; delete c;
cublasXtDestroy(handle);
}

但是,当我尝试将它们包装在 Python 中时,如下所示,我在 cublasXt*gemm 处遇到了段错误。调用:

import ctypes
import numpy as np

_libcublas = ctypes.cdll.LoadLibrary('libcublas.so')
_libcublas.cublasXtCreate.restype = int
_libcublas.cublasXtCreate.argtypes = [ctypes.c_void_p]
_libcublas.cublasXtDestroy.restype = int
_libcublas.cublasXtDestroy.argtypes = [ctypes.c_void_p]
_libcublas.cublasXtDeviceSelect.restype = int
_libcublas.cublasXtDeviceSelect.argtypes = [ctypes.c_void_p,
ctypes.c_int,
ctypes.c_void_p]
_libcublas.cublasXtSgemm.restype = int
_libcublas.cublasXtSgemm.argtypes = [ctypes.c_void_p,
ctypes.c_int,
ctypes.c_int,
ctypes.c_int,
ctypes.c_int,
ctypes.c_int,
ctypes.c_void_p,
ctypes.c_void_p,
ctypes.c_int,
ctypes.c_void_p,
ctypes.c_int,
ctypes.c_void_p,
ctypes.c_void_p,
ctypes.c_int]

handle = ctypes.c_void_p()
_libcublas.cublasXtCreate(ctypes.byref(handle))
deviceId = np.array([0], np.int32)
status = _libcublas.cublasXtDeviceSelect(handle, 1,
deviceId.ctypes.data)
if status:
raise RuntimeError

a = np.random.rand(4, 4).astype(np.float32)
b = np.random.rand(4, 4).astype(np.float32)
c = np.zeros((4, 4), np.float32)

status = _libcublas.cublasXtSgemm(handle, 0, 0, 4, 4, 4,
ctypes.byref(ctypes.c_float(1.0)),
a.ctypes.data, 4, b.ctypes.data, 4,
ctypes.byref(ctypes.c_float(0.0)),
c.ctypes.data, 4)
if status:
raise RuntimeError
print 'success? ', np.allclose(np.dot(a.T, b.T).T, c_gpu.get())
_libcublas.cublasXtDestroy(handle)

奇怪的是,如果我稍微修改一下上面的 Python 包装器以接受 pycuda.gpuarray.GPUArray ,它们就可以工作。我已经传输到 GPU 的矩阵。关于为什么我在将主机内存传递给函数时仅在 Python 中遇到段错误,有什么想法吗?

最佳答案

CUBLAS 文档中这些 Xt<t>gemm 似乎存在错误功能。至少从 CUDA 8 开始,参数 m , n , k , lda , ldb , ldc均为 size_t 类型。这可以通过查看头文件 cublasXt.h 发现.

对您的包装器进行以下修改似乎对我来说可以正常工作:

$ cat t1340.py
import ctypes
import numpy as np

_libcublas = ctypes.cdll.LoadLibrary('libcublas.so')
_libcublas.cublasXtCreate.restype = int
_libcublas.cublasXtCreate.argtypes = [ctypes.c_void_p]
_libcublas.cublasXtDestroy.restype = int
_libcublas.cublasXtDestroy.argtypes = [ctypes.c_void_p]
_libcublas.cublasXtDeviceSelect.restype = int
_libcublas.cublasXtDeviceSelect.argtypes = [ctypes.c_void_p,
ctypes.c_int,
ctypes.c_void_p]
_libcublas.cublasXtSgemm.restype = int
_libcublas.cublasXtSgemm.argtypes = [ctypes.c_void_p,
ctypes.c_int,
ctypes.c_int,
ctypes.c_size_t,
ctypes.c_size_t,
ctypes.c_size_t,
ctypes.c_void_p,
ctypes.c_void_p,
ctypes.c_size_t,
ctypes.c_void_p,
ctypes.c_size_t,
ctypes.c_void_p,
ctypes.c_void_p,
ctypes.c_size_t]

handle = ctypes.c_void_p()
_libcublas.cublasXtCreate(ctypes.byref(handle))
deviceId = np.array([0], np.int32)
status = _libcublas.cublasXtDeviceSelect(handle, 1,
deviceId.ctypes.data)
if status:
raise RuntimeError

a = np.random.rand(4, 4).astype(np.float32)
b = np.random.rand(4, 4).astype(np.float32)
c = np.zeros((4, 4), np.float32)
alpha = ctypes.c_float(1.0)
beta = ctypes.c_float(0.0)

status = _libcublas.cublasXtSgemm(handle, 0, 0, 4, 4, 4,
ctypes.byref(alpha),
a.ctypes.data, 4, b.ctypes.data, 4,
ctypes.byref(beta),
c.ctypes.data, 4)
if status:
raise RuntimeError
print 'success? ', np.allclose(np.dot(a.T, b.T).T, c)
_libcublas.cublasXtDestroy(handle)
$ python t1340.py
success? True
$

列举我所做的更改:

  1. 已更改argtypes对于m , n , k , lda , ldb , ldc cublasXtSgemm 的参数来自c_intc_size_t
  2. 为您的 alpha 和 beta 参数提供显式变量;这可能无关紧要
  3. 在您的np.allclose中功能,已更改 c_gpu.get到只是c

以上内容已在 CUDA 8 和 CUDA 9 上进行了测试。我已向 NVIDIA 提交了内部错误,以更新文档(即使当前的 CUDA 9 文档也不反射(reflect)头文件的当前状态。)

关于python - cublasXt 矩阵乘法在 C++ 中成功,在 Python 中失败,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/47466589/

24 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com