gpt4 book ai didi

python - 编号 : No speed-up for Matrix Multiplication

转载 作者:太空宇宙 更新时间:2023-11-04 01:12:54 25 4
gpt4 key购买 nike

在过去的几天里,我一直在努力理解为什么 Numbapro(来自 Continuum Analytics, Inc. 的加速器;我正在运行 30 天试用版)在我的 MacBook Pro(Intel Core i7、2.6GHz、 16GB RAM,NVIDIA GeForce GT 650M,1GB PCI 总线)。

我从 (NxM)x(MxN) 矩阵乘法的代码中获取了一个示例,其中 Continuum Analytics, Inc. 声称通过 CUDA 加速了计算,我比较了 CUDA.JIT 和 numpy 之间的时间。我的想法是运行例如 1e4 iterations 并且矩阵 B 在每次迭代中随机化。在我使用的以下代码下方,我引用了我获得的时间。有什么解决办法吗?谢谢!

from numbapro import *
from numba import *
import numpy as np
import math
from timeit import default_timer as timer

m=1000
n=1000
A = np.array(np.random.random((n,m)), dtype=np.float32)
C = np.empty([n,n])

iterations = 10000

start = timer()
for i in range(iterations):
B = np.array(np.random.random((m,n)), dtype=np.float32)
X=np.dot(A,B)
numpy_time=(timer() - start)

@cuda.jit(void(float32[:,:],float32[:,:],float32[:,:]))
def cu_square_matrix_mul(A, B, C):

tx = cuda.threadIdx.x
ty = cuda.threadIdx.y
bx = cuda.blockIdx.x
by = cuda.blockIdx.y
bw = cuda.blockDim.x
bh = cuda.blockDim.y
x = tx + bx * bw
y = ty + by * bh
n = C.shape[0]

if x >= n or y >= n:
return

cs = 0
for i in range(n):
cs += A[y,i]*B[i,x]
C[y,x]= cs

cuda.syncthreads()

blockdim = 256,3
griddim = 10,3

stream = cuda.stream()
dA = cuda.to_device(A, stream)
dC = cuda.to_device(C, stream)

start = timer()
for i in range(iterations):
B = np.array(np.random.random((m,n)), dtype=np.float32)
dB = cuda.to_device(B, stream)
cu_square_matrix_mul[griddim,blockdim,stream](dA, dB, dC)
dC.to_host()
stream.synchronize()
cuda_time = (timer() - start)

print
print("Numpy took %f seconds" % numpy_time)
print("CUDA JIT took %f seconds, %.5fx speedup" % (cuda_time, numpy_time / cuda_time))

结果:

Vendor:  Continuum Analytics, Inc.
Package: mkl
Message: trial mode expires in 30 days
Vendor: Continuum Analytics, Inc.
Package: mkl
Message: trial mode expires in 30 days
Vendor: Continuum Analytics, Inc.
Package: numbapro
Message: trial mode expires in 30 days

Numpy took 378.328881 seconds
CUDA JIT took 342.723757 seconds, 1.10389x speedup

最佳答案

这是 GPU 上一个完全朴素的矩阵乘法例程,而 numpy 例程实际上是一个库调用:

X=np.dot(A,B)

可能会被高度优化。 GPU 的速度更快给我留下了深刻的印象。

“解决方案”是 make a call to CUBLAS用于矩阵乘法,而不是编写自己的内核。

关于python - 编号 : No speed-up for Matrix Multiplication,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/26568194/

25 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com