python - 如何使用 numba 在 GPU 上推广快速矩阵乘法-6ren

python - 如何使用 numba 在 GPU 上推广快速矩阵乘法

转载作者：行者123 更新时间：2023-12-05 08:25:33

最近我一直在尝试使用 Numba 库在 Python 中进行 GPU 编程。我一直在使用那里的教程在他们的网站上阅读它，目前我停留在他们的示例上，可以在这里找到:https://numba.pydata.org/numba-doc/latest/cuda/examples.html .我试图概括一下快速矩阵乘法的示例(其形式为 A*B=C)。在测试时，我注意到维度不能被每 block 线程数 (TPB) 完全整除的矩阵不会产生正确的答案。

我从 https://numba.pydata.org/numba-doc/latest/cuda/examples.html 的示例中复制了下面的代码并用 4 x 4 矩阵创建了一个非常小的测试用例。如果我选择 TPB=2 一切都很好，但是当我设置 TPB=3 时就出错了。我知道代码超出了矩阵的范围，但我无法阻止这种情况的发生(我在 ty + i * TPB 和 tx + i * 上尝试了一些 if 语句TPB，但这些都不起作用。

from numba import cuda, float32
import numpy as np
import math

@cuda.jit
def fast_matmul(A, B, C):
    # Define an array in the shared memory
    # The size and type of the arrays must be known at compile time
    sA = cuda.shared.array(shape=(TPB, TPB), dtype=float32)
    sB = cuda.shared.array(shape=(TPB, TPB), dtype=float32)

    x, y = cuda.grid(2)

    tx = cuda.threadIdx.x
    ty = cuda.threadIdx.y
    bpg = cuda.gridDim.x    # blocks per grid

    if x >= C.shape[0] and y >= C.shape[1]:
        # Quit if (x, y) is outside of valid C boundary
        return

    # Each thread computes one element in the result matrix.
    # The dot product is chunked into dot products of TPB-long vectors.
    tmp = 0.
    for i in range(bpg):
        # Preload data into shared memory
        sA[tx, ty] = A[x, ty + i * TPB]
        sB[tx, ty] = B[tx + i * TPB, y]

        # Wait until all threads finish preloading
        cuda.syncthreads()

        # Computes partial product on the shared memory
        for j in range(TPB):
            tmp += sA[tx, j] * sB[j, ty]

        # Wait until all threads finish computing
        cuda.syncthreads()

    C[x, y] = tmp



#%%

x_h = np.arange(16).reshape([4,4])
y_h = np.ones([4,4])
z_h = np.zeros([4,4])

x_d = cuda.to_device(x_h)
y_d = cuda.to_device(y_h)
z_d = cuda.to_device(z_h)

TPB = 3
threadsperblock = (TPB, TPB)
blockspergrid_x = math.ceil(z_h.shape[0] / threadsperblock[0])
blockspergrid_y = math.ceil(z_h.shape[1] / threadsperblock[1])
blockspergrid = (blockspergrid_x, blockspergrid_y)

fast_matmul[blockspergrid, threadsperblock](x_d, y_d, z_d)
z_h = z_d.copy_to_host()
print(z_h)

我想编写一些不依赖于维度可被 TPB 完全整除的矩阵 A、B 和 C 的代码，因为这些有时是我无法控制的。我知道 GPU 只有在对非常大的矩阵进行矩阵乘法时才会更快，但我想使用小示例来检查答案是否正确，然后再将其应用于实际数据。

最佳答案

可以说 that posted code 中至少有两个错误:

这不可能是正确的范围检查:
```
if x >= C.shape[0] and y >= C.shape[1]:
```
为了让我们确定网格中的特定线程不执行任何加载事件，我们需要要么 x 超出范围或者 y 超出范围。 and 应该是 or。
是illegal在条件代码中使用 cuda.syncthreads()，如果 block 中的所有线程都不能参与该语句。上面第 1 项中的前面的 return 语句(即使从 and 更正为 or)几乎可以保证这种非法行为对于问题规模不是整体 -数字可被线程 block 大小整除。

因此，要解决这些问题，我们不能只对越界线程使用简单的return 语句。相反，在加载点，如果计算出的全局加载索引(对于 A 或 B)在 -边界(根据定义，共享索引在边界内)。此外，在写入结果时，我们必须只写入 C 范围内的计算结果。

以下代码修复了这些项目。它似乎适用于您给定的测试用例:

$ cat t49.py
from numba import cuda, float32
import numpy as np
import math

@cuda.jit
def fast_matmul(A, B, C):
    # Define an array in the shared memory
    # The size and type of the arrays must be known at compile time
    sA = cuda.shared.array(shape=(TPB, TPB), dtype=float32)
    sB = cuda.shared.array(shape=(TPB, TPB), dtype=float32)

    x, y = cuda.grid(2)

    tx = cuda.threadIdx.x
    ty = cuda.threadIdx.y
    bpg = cuda.gridDim.x    # blocks per grid

    # Each thread computes one element in the result matrix.
    # The dot product is chunked into dot products of TPB-long vectors.
    tmp = float32(0.)
    for i in range(bpg):
        # Preload data into shared memory
        sA[tx, ty] = 0
        sB[tx, ty] = 0
        if x < A.shape[0] and (ty+i*TPB) < A.shape[1]:
          sA[tx, ty] = A[x, ty + i * TPB]
        if y < B.shape[1] and (tx+i*TPB) < B.shape[0]:
          sB[tx, ty] = B[tx + i * TPB, y]

        # Wait until all threads finish preloading
        cuda.syncthreads()

        # Computes partial product on the shared memory
        for j in range(TPB):
            tmp += sA[tx, j] * sB[j, ty]

        # Wait until all threads finish computing
        cuda.syncthreads()
    if x < C.shape[0] and y < C.shape[1]:
        C[x, y] = tmp



#%%

x_h = np.arange(16).reshape([4,4])
y_h = np.ones([4,4])
z_h = np.zeros([4,4])

x_d = cuda.to_device(x_h)
y_d = cuda.to_device(y_h)
z_d = cuda.to_device(z_h)

TPB = 3
threadsperblock = (TPB, TPB)
blockspergrid_x = math.ceil(z_h.shape[0] / threadsperblock[0])
blockspergrid_y = math.ceil(z_h.shape[1] / threadsperblock[1])
blockspergrid = (blockspergrid_x, blockspergrid_y)

fast_matmul[blockspergrid, threadsperblock](x_d, y_d, z_d)
z_h = z_d.copy_to_host()
print(z_h)
print(x_h@y_h)
$ cuda-memcheck python t49.py
========= CUDA-MEMCHECK
[[ 6.  6.  6.  6.]
 [22. 22. 22. 22.]
 [38. 38. 38. 38.]
 [54. 54. 54. 54.]]
[[ 6.  6.  6.  6.]
 [22. 22. 22. 22.]
 [38. 38. 38. 38.]
 [54. 54. 54. 54.]]
========= ERROR SUMMARY: 0 errors
$

(请注意，在边界测试中使用 和 是正确的。在 bool 意义上测试一组索引是否在边界内与测试一组索引是否在边界外是不同的- 越界。在入界测试中，我们要求两者都在界内。在越界测试中，任何一个指标越界都属于失格)。

我并不是说上面的代码没有缺陷或适合任何特定用途。提供它是为了演示我发现的问题的可能修复方法。正如您所发现的那样，让共享内存平铺矩阵乘以在每个可以想象的配置中工作是非常重要的，而且我没有在此处显示的范围之外对其进行测试。 (例如，如果你决定让 TPB 大于 32，你会遇到其他问题。另外，原来发布的代码只宣传方阵乘法，这在一般的非方阵情况下不起作用。)

如上所述，发布的代码和上面带有“修复”的代码将无法正确处理一般的非方形情况。我相信一些简单的修改将使我们能够处理非正方形的情况。简而言之，我们必须将网格的大小设置得足够大以处理两个输入矩阵的维度，同时仍然只为输出矩阵的边界值写入结果。这是一个经过简单测试的示例:

$ cat t49.py
from numba import cuda, float32
import numpy as np
import math

@cuda.jit
def fast_matmul(A, B, C):
    # Define an array in the shared memory
    # The size and type of the arrays must be known at compile time
    sA = cuda.shared.array(shape=(TPB, TPB), dtype=float32)
    sB = cuda.shared.array(shape=(TPB, TPB), dtype=float32)

    x, y = cuda.grid(2)

    tx = cuda.threadIdx.x
    ty = cuda.threadIdx.y
    bpg = cuda.gridDim.x    # blocks per grid

    # Each thread computes one element in the result matrix.
    # The dot product is chunked into dot products of TPB-long vectors.
    tmp = float32(0.)
    for i in range(bpg):
        # Preload data into shared memory
        sA[ty, tx] = 0
        sB[ty, tx] = 0
        if y < A.shape[0] and (tx+i*TPB) < A.shape[1]:
          sA[ty, tx] = A[y, tx + i * TPB]
        if x < B.shape[1] and (ty+i*TPB) < B.shape[0]:
          sB[ty, tx] = B[ty + i * TPB, x]

        # Wait until all threads finish preloading
        cuda.syncthreads()

        # Computes partial product on the shared memory
        for j in range(TPB):
            tmp += sA[ty, j] * sB[j, tx]

        # Wait until all threads finish computing
        cuda.syncthreads()
    if y < C.shape[0] and x < C.shape[1]:
        C[y, x] = tmp



#%%

x_h = np.arange(115).reshape([5,23])
y_h = np.ones([23,7])
z_h = np.zeros([5,7])

x_d = cuda.to_device(x_h)
y_d = cuda.to_device(y_h)
z_d = cuda.to_device(z_h)

#TPB must be an integer between 1 and 32
TPB = 32
threadsperblock = (TPB, TPB)
grid_y_max = max(x_h.shape[0],y_h.shape[0])
grid_x_max = max(x_h.shape[1],y_h.shape[1])
blockspergrid_x = math.ceil(grid_x_max / threadsperblock[0])
blockspergrid_y = math.ceil(grid_y_max / threadsperblock[1])
blockspergrid = (blockspergrid_x, blockspergrid_y)

fast_matmul[blockspergrid, threadsperblock](x_d, y_d, z_d)
z_h = z_d.copy_to_host()
print(z_h)
print(x_h@y_h)
$ cuda-memcheck python t49.py
========= CUDA-MEMCHECK
[[ 253.  253.  253.  253.  253.  253.  253.]
 [ 782.  782.  782.  782.  782.  782.  782.]
 [1311. 1311. 1311. 1311. 1311. 1311. 1311.]
 [1840. 1840. 1840. 1840. 1840. 1840. 1840.]
 [2369. 2369. 2369. 2369. 2369. 2369. 2369.]]
[[ 253.  253.  253.  253.  253.  253.  253.]
 [ 782.  782.  782.  782.  782.  782.  782.]
 [1311. 1311. 1311. 1311. 1311. 1311. 1311.]
 [1840. 1840. 1840. 1840. 1840. 1840. 1840.]
 [2369. 2369. 2369. 2369. 2369. 2369. 2369.]]
========= ERROR SUMMARY: 0 errors
$

我还重新排序了 x 和 y 的含义(以及 tx 和 ty 的用法)修复上述代码中的性能问题。原始发布的文档代码中也存在相同的性能问题。

同样，没有无缺陷的声明。此外，我确信可以得出“更优化”的代码。然而，优化矩阵乘法是一项应该很快导致使用库实现的练习。使用 cupy这里的 GPU 方法应该是一种在 GPU 上利用高质量矩阵乘法例程的相当直接的方法。

编辑:正如所讨论的here OP 的代码(似乎还有 doc example)在 tmp 变量的设置方面也存在性能问题。将其更改为适当的 32 位浮点变量会产生重要的性能差异。

关于python - 如何使用 numba 在 GPU 上推广快速矩阵乘法，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/64197780/

文章推荐： r - 按列计算唯一值

文章推荐： swift - 在 RealityKit 中启用手势

文章推荐： python - 重复值 n 次，n 在一个数组中

文章推荐： java - 检查一个数组列表是否包含两个字符串

numba - 在 Numba 优化的 Python 中将类对象作为函数参数传递
我想将一个类对象传递给一个函数。我可以让它工作，但我想知道是否有我可以分配的类型？我有一个我正在尝试做的“最小”示例。 spec = [("a", float64),("b",float64)] @j
python - numba - 打字错误 : cannot determine Numba type of
我有一个简单的函数来对扑克手牌进行排序(手牌是字符串)。我用 rA,rB = rank(a),rank(b) 调用它，这是我的实现。没有 @jit(nopython=True) 也能很好地工作，但是
python - numpy 比 numba 和 cython 快，如何改进 numba 代码
我在这里有一个简单的例子来帮助我理解使用 numba 和 cython。我是 numba 和 cython 的新手。我已经尽力结合所有技巧来使 numba 更快，并且在某种程度上，cython 也是如
python - 如何使 numba @jit 使用所有 cpu 内核(并行化 numba @jit)
我正在使用 numbas @jit 装饰器在 python 中添加两个 numpy 数组。如果我使用 @jit 与 python 相比，性能是如此之高。然而，即使我传入 @numba.jit(nop
python - Numba jit nopython 模式 : tell numba the signature of an external arbitrary function
我需要为通用指标构建相异矩阵。由于我需要算法快速运行，所以我在 nopython 模式下使用了 numba 0.35。这是我的代码 import numpy as np from numba impo
python - Numba 支持 cuda 协作 block 同步？？ Python numba cuda 网格同步
Numba Cuda 有 syncthreads() 来同步一个 block 中的所有线程。如何在不退出当前内核的情况下同步网格中的所有 block ？在 C-Cuda 中有一个 cooperati
numba - 如何在协作室中使用numba
有人尝试在Google合作伙伴中使用numba吗？我只是不知道如何在此环境中进行设置。此刻，我陷入了错误library nvvm not found。最佳答案将此代码复制到单元格中。这个对我有用
python - Numba:回退到对象模式时抑制错误
我想编写一个函数，它既可以作为 jitted 函数运行，也可以作为普通 python 或对象模式 numba 运行，具体取决于 numba 是否能够进行类型推断。我实际上更喜欢普通的 python，但
list - Numba 从列表创建元组
我有一个非常简单的问题我无法解决。我正在使用 Numba 和 Cuda。我有一个列表 T=[1.0,2.0,3.0,4.0,5.0,6.0,7.0,8.0,9.0] 我想要一个包含列表元素的元组，如
python - Numba 没有提高性能
我正在测试一些采用 numpy 数组的函数的 numba 性能，并比较: import numpy as np from numba import jit, vectorize, float64 im
python - 插值 - Numba
我正在使用 Scipy 的 interpolate.interp1d 在 Python3 中插入一维数组。我想将它与 numba 一起使用，但不支持 scipy 和此功能。是否有 numba 支持
python - Numba 并行代码比顺序代码慢
我是 Numba 的新手，我正在尝试使用 Numba(版本 0.54.1)在 Python 中实现旧的 Fortran 代码，但是当我添加 parallel = True 时，程序实际上变慢了.我的程
python - Numba 的最佳可能位数组
我需要在 Python 中创建一个位数组。到目前为止，我发现可以使用 bitarray 生成非常节省内存的数组。模块。然而，我的最终目的是使用来自Numba 的@vectorize 装饰器。 . N
python - Numba - 字符串类型
我认为这是一个简单的问题，但我发现 numba 文档缺乏关于如何将字符串类型与 numpy 数组和字典一起使用的信息。我有一个我想使用 numba 的函数，它需要一个邮政编码列表，然后是一个映射邮政编
python - 如何在多个功能中最佳地使用 numba？
假设我有两个功能 def my_sub1(a): return a + 2 def my_main(a): a += 1 b = mysub1(a) return b
python - numba 编译逻辑比较中的性能损失
在以下用于逻辑比较的 numba 编译函数中，性能下降的原因可能是什么: from numba import njit t = (True, 'and_', False) #@njit(boolean
python - Numba 中的笛卡尔积
我的代码使用如下列表的笛卡尔积: import itertools cartesian_product = itertools.product(list('ABCDEF'), repeat=n) n可
gpu - Numba 中的组合矢量化函数
我正在使用 Numba(版本 0.37.0)来优化 GPU 代码。我想使用组合矢量化函数(使用 Numba 的 @vectorize 装饰器)。导入和数据: import numpy as np f
python - numba 中两个列表的交集
我想知道在 numba 函数中计算两个列表的交集的最快方法。只是为了澄清:两个列表的交集示例: Input : lst1 = [15, 9, 10, 56, 23, 78, 5, 4, 9] lst2
python - Numba 函数与类型参数的使用无效
我正在使用 Numba 非 python 模式和一些 NumPy 函数。 @njit def invert(W, copy=True): ''' Inverts elementwise

行者123

个人简介

我是一名优秀的程序员,十分优秀！

作者热门文章

滴滴打车优惠券免费领取

全站热门文章

首页

博学

6Ren·AI

商城

python - 如何使用 numba 在 GPU 上推广快速矩阵乘法