python - NumPy/SciPy 中的多线程整数矩阵乘法-6ren

python - NumPy/SciPy 中的多线程整数矩阵乘法

转载作者：IT老高更新时间：2023-10-28 20:35:12

44

4

做类似的事情

import numpy as np
a = np.random.rand(10**4, 10**4)
b = np.dot(a, a)

使用多核，运行良好。

a 中的元素是 64 位 float (或 32 位平台中的 32 位？)，我想乘以 8 位整数数组。不过，请尝试以下方法:

a = np.random.randint(2, size=(n, n)).astype(np.int8)

导致点积不使用多个内核，因此在我的 PC 上运行速度慢了约 1000 倍。

array: np.random.randint(2, size=shape).astype(dtype)

dtype    shape          %time (average)

float32 (2000, 2000)    62.5 ms
float32 (3000, 3000)    219 ms
float32 (4000, 4000)    328 ms
float32 (10000, 10000)  4.09 s

int8    (2000, 2000)    13 seconds
int8    (3000, 3000)    3min 26s
int8    (4000, 4000)    12min 20s
int8    (10000, 10000)  It didn't finish in 6 hours

float16 (2000, 2000)    2min 25s
float16 (3000, 3000)    Not tested
float16 (4000, 4000)    Not tested
float16 (10000, 10000)  Not tested

我知道 NumPy 使用 BLAS，它不支持整数，但如果我使用 SciPy BLAS 包装器，即。

import scipy.linalg.blas as blas
a = np.random.randint(2, size=(n, n)).astype(np.int8)
b = blas.sgemm(alpha=1.0, a=a, b=a)

计算是多线程的。现在，blas.sgemm 的运行时间与 float32 的 np.dot 完全相同，但对于非 float ，它将所有内容都转换为 float32 和输出 float ，这是 np.dot 不做的。 (此外，b 现在是 F_CONTIGUOUS 顺序，这是一个较小的问题)。

所以，如果我想进行整数矩阵乘法，我必须执行以下操作之一:

使用 NumPy 非常缓慢的 np.dot，很高兴我能保留 8 位整数。
使用 SciPy 的 sgemm 并使用 4x 内存。
使用 Numpy 的 np.float16 并且只使用 2x 内存，但需要注意的是 np.dot 在 float16 数组上比在 float32 数组上慢得多，比int8.
为多线程整数矩阵乘法找到一个优化的库(实际上，Mathematica 可以做到这一点，但我更喜欢 Python 解决方案)，理想情况下支持 1 位数组，尽管 8 位数组也很好......(我实际上的目标是在有限域 Z/2Z 上进行矩阵乘法，并且我知道我可以使用 Sage 来做到这一点，这很 Pythonic，但是，再次，有什么严格意义上的 Python 吗？)

我可以遵循选项 4 吗？有这样的图书馆吗？

免责声明:我实际上是在运行 NumPy + MKL，但我在 vanilly NumPy 上尝试了类似的测试，结果类似。

最佳答案

请注意，虽然这个答案变得陈旧，但 numpy 可能会获得优化的整数支持。请验证此答案在您的设置中是否仍然可以更快地工作。

选项 5 - 推出自定义解决方案: 将矩阵产品划分为几个子产品并并行执行。使用标准 Python 模块可以相对容易地实现这一点。子产品是用 numpy.dot 计算的，它会释放全局解释器锁。因此，可以使用 threads它们相对轻量级，可以从主线程访问数组以提高内存效率。

实现:

import numpy as np
from numpy.testing import assert_array_equal
import threading
from time import time


def blockshaped(arr, nrows, ncols):
    """
    Return an array of shape (nrows, ncols, n, m) where
    n * nrows, m * ncols = arr.shape.
    This should be a view of the original array.
    """
    h, w = arr.shape
    n, m = h // nrows, w // ncols
    return arr.reshape(nrows, n, ncols, m).swapaxes(1, 2)


def do_dot(a, b, out):
    #np.dot(a, b, out)  # does not work. maybe because out is not C-contiguous?
    out[:] = np.dot(a, b)  # less efficient because the output is stored in a temporary array?


def pardot(a, b, nblocks, mblocks, dot_func=do_dot):
    """
    Return the matrix product a * b.
    The product is split into nblocks * mblocks partitions that are performed
    in parallel threads.
    """
    n_jobs = nblocks * mblocks
    print('running {} jobs in parallel'.format(n_jobs))

    out = np.empty((a.shape[0], b.shape[1]), dtype=a.dtype)

    out_blocks = blockshaped(out, nblocks, mblocks)
    a_blocks = blockshaped(a, nblocks, 1)
    b_blocks = blockshaped(b, 1, mblocks)

    threads = []
    for i in range(nblocks):
        for j in range(mblocks):
            th = threading.Thread(target=dot_func, 
                                  args=(a_blocks[i, 0, :, :], 
                                        b_blocks[0, j, :, :], 
                                        out_blocks[i, j, :, :]))
            th.start()
            threads.append(th)

    for th in threads:
        th.join()

    return out


if __name__ == '__main__':
    a = np.ones((4, 3), dtype=int)
    b = np.arange(18, dtype=int).reshape(3, 6)
    assert_array_equal(pardot(a, b, 2, 2), np.dot(a, b))

    a = np.random.randn(1500, 1500).astype(int)

    start = time()
    pardot(a, a, 2, 4)
    time_par = time() - start
    print('pardot: {:.2f} seconds taken'.format(time_par))

    start = time()
    np.dot(a, a)
    time_dot = time() - start
    print('np.dot: {:.2f} seconds taken'.format(time_dot))

通过这个实现，我得到了大约 x4 的加速，这是我机器中的物理内核数:

running 8 jobs in parallel
pardot: 5.45 seconds taken
np.dot: 22.30 seconds taken

关于python - NumPy/SciPy 中的多线程整数矩阵乘法，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/35101312/

44

4

0

文章推荐： java - 并行流与流并行

文章推荐： python - 惯用的 Python : 'times' loop

文章推荐： python - 如何使用非唯一的 bin 边缘进行 qcut？

乘法
如果矩阵A在X中，矩阵B在Y中。进行乘法运算只是 Z = X*Y。正确假设两个数组的大小相同。如何使用 for 循环计算它？最佳答案 ja72 的anwser 是错误的，请查看我在其下的评论以了
c - (n - 乘法) vs (n/2 - 乘法 + 2 加法) 哪个更好？
我有一个 C 程序，它有 n 次乘法(单次乘法和 n 次迭代)，我发现另一个逻辑有 n/2 次迭代(1 次乘法 + 2 次加法)。我知道两者都是 O(n) 的复杂性。但就 CPU 周期而言。哪个更快？
矩阵的行向后累积乘积/乘法
我有一个矩阵x: x <- matrix(1:8, nrow = 2, ncol = 4, byrow = 2) # [,1] [,2] [,3] [,4] #[1,] 1 2 3
矩阵的行向后累积乘积/乘法
我有一个矩阵x: x <- matrix(1:8, nrow = 2, ncol = 4, byrow = 2) # [,1] [,2] [,3] [,4] #[1,] 1 2 3
Java 乘法
我正在创建一个基于电影 InTime 的 Minecraft 插件，并尝试创建代码，在玩家死亡时玩家将失去 25% 的时间。当前代码是: String minus = itapi.getTimeSt
2个矩阵的C++乘法
我正在尝试将 2 个矩阵与重载的 * 运算符相乘并打印结果。虽然看起来我不能为重载函数提供超过 1 个参数。如何将这两个矩阵传递给重载函数？请在下面查看我的实现。 #include #include
Java 乘法 .*
为什么在 Java 中使用 .*？例如 double probability = 1.*count/numdata; 给出相同的输出: double probability = count/numda
带单位的 SASS 乘法
如果我尝试将两个值与单位相乘，则会出现意外错误。 $test: 10px; .testing{ width: $test * $test; } result: 100px*px isn't a v
CodeIgniter ActiveRecord 乘法
我正在尝试计算库存中所有产品的总值(value)。表中的每种产品都有价格和数量。因此，我需要将每种产品的价格乘以数量，然后将所有这些加在一起以获得所有产品的总计。根据上一个问题，我现在可以使用 MyS
CodeIgniter ActiveRecord 乘法
我正在尝试计算库存中所有产品的总值(value)。表中的每种产品都有价格和数量。因此，我需要将每种产品的价格乘以数量，然后将所有这些加在一起以获得所有产品的总计。根据上一个问题，我现在可以使用 MyS
Java ArrayList 乘法
大家好，我有以下代码行 solution first = mylist.remove((int)(Math.random() * mylist)); 这给了我一个错误说明 The operator *
C、乘法、位运算或*
我必须做很多乘法运算。如果我考虑效率，那么我应该使用位运算而不是常规的 * 运算吗？如果有差异如何进行位运算？提前致谢.. 最佳答案不，您应该使用乘法运算符，让优化编译器决定如何最快地完成它。您会
math - 大整数的 OR 乘法
两个 n 位数字 A 和 B 的乘法可以理解为移位的总和: (A << i1) + (A << i2) + ... 其中 i1, i2, ... 是 B 中设置为 1 的位数。现在让我们用 OR
c++ - bool 乘法
我想使用 cuda 6 进行 bool 乘法，但我无法以正确的方式做到这一点。B 是一个 bool 对称矩阵，我必须进行 B^n bool 乘法。我的 C++ 代码是: for (m=0; m
c++ - 仿真定点除法/乘法
我正在编写一个定点类，但遇到了一些问题...乘法、除法部分，我不确定如何模拟。我对部门运算符(operator)进行了非常粗暴的尝试，但我确信这是错误的。到目前为止，它是这样的: class Fixe
sql - SQL中的表分析(乘法)
我有TABLE_A我需要创建 TABLE_A_FINAL 规则: 在TABLE_A_FINAL中我们有包含 ID_C 的所有可能组合的行如果在 TABLE_A与 ID_C 的组合相同我们乘以 WEIG
java - 在java中重复字母(乘法)
这个问题在这里已经有了答案: Simple way to repeat a string (32 个答案) 关闭 6 年前。我有一个任务是重复字符乘以它例如用户应该写重复输入 3 R 输出的字母和
c++ - 复合赋值(乘法)
我最近学习了C++的基础知识。我发现了一些我不明白的东西。这是让我有点困惑的程序。 #include using namespace std; int main()
两个列表的 Python 乘法
我有两个列表: list_a = list_b = list(范围(2, 6)) final_list = [] 我想知道如何将两个列表中的所有值相乘。我希望我的 final_list 包含 [2*2
任何基数的 C++ 乘法
如何修改此代码以适用于任何基数？ (二进制、十六进制、基数 10 等) int mult(int a, int b, int base){ if((a<=base)||(b<=base)){

首页

博学

6Ren·AI

商城

python - NumPy/SciPy 中的多线程整数矩阵乘法