python - 如何确定 numba 的 prange 是否真的正常工作？-6ren

python - 如何确定 numba 的 prange 是否真的正常工作？

转载作者：太空狗更新时间：2023-10-30 00:08:51

在另一个 Q+A ( Can I perform dynamic cumsum of rows in pandas? ) 中，我对使用 prange 的正确性发表了评论。关于此代码(this answer):

from numba import njit, prange

@njit
def dynamic_cumsum(seq, index, max_value):
    cumsum = []
    running = 0
    for i in prange(len(seq)):
        if running > max_value:
            cumsum.append([index[i], running])
            running = 0
        running += seq[i] 
    cumsum.append([index[-1], running])

    return cumsum

评论是:

I wouldn't recommend parallelizing a loop that isn't pure. In this case the running variable makes it impure. There are 4 possible outcomes: (1)numba decides that it cannot parallelize it and just process the loop as if it was cumsum instead of prange (2)it can lift the variable outside the loop and use parallelization on the remainder (3)numba incorrectly inserts synchronization between the parallel executions and the result may be bogus (4)numba inserts the necessary synchronizations around running which may impose more overhead than you gain by parallelizing it in the first place

以及后来的补充:

Of course both the running and cumsum variable make the loop "impure", not just the running variable as stated in the previous comment

然后我被问到:

This might sound like a silly question, but how can I figure out which of the 4 things it did and improve it? I would really like to become better with numba!

鉴于它可能对 future 的读者有用，我决定在这里创建一个自我回答的 Q+A。剧透:我无法真正回答产生 4 个结果中的哪一个(或者 numba 是否产生完全不同的结果)的问题，所以我非常鼓励其他答案。

最佳答案

TL;DR:首先:prange与 range 相同，除非您将 parallel 添加到 jit ，例如 njit(parallel=True) .如果您尝试这样做，您会看到有关“不受支持的减少”的异常 - 那是因为 Numba 限制了 prange 的范围。至 “纯”循环 和 numba 支持的减少的“不纯循环”并让用户负责确保它属于这些类别中的任何一个。

这在 numbas prange (version 0.42) 的文档中有明确说明。 :

1.10.2. Explicit Parallel Loops

Another feature of this code transformation pass is support for explicit parallel loops. One can use Numba’s prange instead of range to specify that a loop can be parallelized. The user is required to make sure that the loop does not have cross iteration dependencies except for supported reductions.

在该文档中，评论所指的“不纯”被称为“交叉迭代依赖”。这种“交叉迭代依赖”是一个在循环之间变化的变量。一个简单的例子是:

def func(n):
    a = 0
    for i in range(n):
        a += 1
    return a

这里的变量 a取决于它在循环开始之前的值和执行了多少次循环迭代。这就是“交叉迭代依赖”或“不纯”循环的含义。

显式并行化此类循环时的问题在于迭代是并行执行的，但每次迭代都需要知道其他迭代在做什么。不这样做会导致错误的结果。

让我们暂时假设 prange将产生 4 个 worker ，我们通过 4如 n到函数。一个完全幼稚的实现会做什么？

Worker 1 starts, gets a i = 1 from `prange`, and reads a = 0
Worker 2 starts, gets a i = 2 from `prange`, and reads a = 0
Worker 3 starts, gets a i = 3 from `prange`, and reads a = 0
Worker 1 executed the loop and sets `a = a + 1` (=> 1)
Worker 3 executed the loop and sets `a = a + 1` (=> 1)
Worker 4 starts, gets a i = 4 from `prange`, and reads a = 2
Worker 2 executed the loop and sets `a = a + 1` (=> 1)
Worker 4 executed the loop and sets `a = a + 1` (=> 3)

=> Loop ended, function return 3

不同 worker 读取、执行和写入 a 的顺序可以是任意的，这只是一个例子。它也可以(偶然地)产生正确的结果!这通常称为 Race condition .

什么会更复杂 prange这是否承认存在这样的交叉迭代依赖？

共有三个选项:

只是不要并行化它。

实现工作人员共享变量的机制。这里的典型例子是 Locks (这可能会导致高开销)。

认识到这是一种可以并行化的减少。

鉴于我对 numba 文档的理解(再次重复):

The user is required to make sure that the loop does not have cross iteration dependencies except for supported reductions.

Numba 确实:

如果它是已知的减少，则使用模式将其并行化

如果不是已知的减少，则抛出异常

不幸的是，目前尚不清楚“支持的减少”是什么。但是文档提示它是对循环体中的前一个值进行操作的二元运算符:

A reduction is inferred automatically if a variable is updated by a binary function/operator using its previous value in the loop body. The initial value of the reduction is inferred automatically for += and *= operators. For other functions/operators, the reduction variable should hold the identity value right before entering the prange loop. Reductions in this manner are supported for scalars and for arrays of arbitrary dimensions.

OP 中的代码使用列表作为交叉迭代依赖并调用 list.append在循环体中。我个人不会打电话 list.append减少并且它没有使用二元运算符所以我的假设是它很可能是 不支持 .至于其他的交叉迭代依赖 running :它在上一次迭代的结果上使用加法(这很好)，但如果它超过阈值(这可能不好)，也会有条件地将其重置为零。

Numba 提供了检查中间代码(LLVM 和 ASM)代码的方法:

dynamic_cumsum.inspect_types()
dynamic_cumsum.inspect_llvm()
dynamic_cumsum.inspect_asm()

但是，即使我对结果有必要的理解以对发出的代码的正确性做出任何声明 - 一般来说，“证明”多线程/进程代码正常工作是非常重要的。鉴于我什至缺乏 LLVM 和 ASM 知识，甚至无法查看它是否试图并行化它，我实际上无法回答您的具体问题，它会产生哪种结果。

回到代码，如前所述，如果我使用 parallel=True，它会抛出异常(不支持的减少) ，所以我假设 numba 不会并行化示例中的任何内容:

from numba import njit, prange

@njit(parallel=True)
def dynamic_cumsum(seq, index, max_value):
    cumsum = []
    running = 0
    for i in prange(len(seq)):
        if running > max_value:
            cumsum.append([index[i], running])
            running = 0
        running += seq[i] 
    cumsum.append([index[-1], running])

    return cumsum

dynamic_cumsum(np.ones(100), np.arange(100), 10)

AssertionError: Invalid reduction format

During handling of the above exception, another exception occurred:

LoweringError: Failed in nopython mode pipeline (step: nopython mode backend)
Invalid reduction format

File "<>", line 7:
def dynamic_cumsum(seq, index, max_value):
    <source elided>
    running = 0
    for i in prange(len(seq)):
    ^

[1] During: lowering "id=2[LoopNest(index_variable = parfor_index.192, range = (0, seq_size0.189, 1))]{56: <ir.Block at <> (10)>, 24: <ir.Block at <> (7)>, 34: <ir.Block at <> (8)>}Var(parfor_index.192, <> (7))" at <> (7)

那么还有什么要说的: prange不提供任何速度优势 在这种情况下 超正常 range (因为它不是并行执行的)。因此，在这种情况下，我不会“冒险”潜在问题和/或让读者感到困惑——因为根据 numba 文档不支持它。

from numba import njit, prange

@njit
def p_dynamic_cumsum(seq, index, max_value):
    cumsum = []
    running = 0
    for i in prange(len(seq)):
        if running > max_value:
            cumsum.append([index[i], running])
            running = 0
        running += seq[i] 
    cumsum.append([index[-1], running])

    return cumsum

@njit
def dynamic_cumsum(seq, index, max_value):
    cumsum = []
    running = 0
    for i in range(len(seq)):  # <-- here is the only change
        if running > max_value:
            cumsum.append([index[i], running])
            running = 0
        running += seq[i] 
    cumsum.append([index[-1], running])

    return cumsum

只是一个快速的时间，支持我之前所做的“不快于”声明:

import numpy as np
seq = np.random.randint(0, 100, 10_000_000)
index = np.arange(10_000_000)
max_ = 500
# Correctness and warm-up
assert p_dynamic_cumsum(seq, index, max_) == dynamic_cumsum(seq, index, max_)
%timeit p_dynamic_cumsum(seq, index, max_)
# 468 ms ± 12.5 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
%timeit dynamic_cumsum(seq, index, max_)
# 470 ms ± 9.49 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

关于python - 如何确定 numba 的 prange 是否真的正常工作？，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/54583152/

文章推荐： c# - 如何学习线程安全的c#编程？

文章推荐： c# - 如何自定义像这样的按钮控件？

文章推荐： python - 如何在 Python 中创建从 Pub/Sub 到 GCS 的数据流管道

文章推荐： c# - 在 Linq to SQL 中连接两个表

numba - 在 Numba 优化的 Python 中将类对象作为函数参数传递
我想将一个类对象传递给一个函数。我可以让它工作，但我想知道是否有我可以分配的类型？我有一个我正在尝试做的“最小”示例。 spec = [("a", float64),("b",float64)] @j
python - numba - 打字错误 : cannot determine Numba type of
我有一个简单的函数来对扑克手牌进行排序(手牌是字符串)。我用 rA,rB = rank(a),rank(b) 调用它，这是我的实现。没有 @jit(nopython=True) 也能很好地工作，但是
python - numpy 比 numba 和 cython 快，如何改进 numba 代码
我在这里有一个简单的例子来帮助我理解使用 numba 和 cython。我是 numba 和 cython 的新手。我已经尽力结合所有技巧来使 numba 更快，并且在某种程度上，cython 也是如
python - 如何使 numba @jit 使用所有 cpu 内核(并行化 numba @jit)
我正在使用 numbas @jit 装饰器在 python 中添加两个 numpy 数组。如果我使用 @jit 与 python 相比，性能是如此之高。然而，即使我传入 @numba.jit(nop
python - Numba jit nopython 模式 : tell numba the signature of an external arbitrary function
我需要为通用指标构建相异矩阵。由于我需要算法快速运行，所以我在 nopython 模式下使用了 numba 0.35。这是我的代码 import numpy as np from numba impo
python - Numba 支持 cuda 协作 block 同步？？ Python numba cuda 网格同步
Numba Cuda 有 syncthreads() 来同步一个 block 中的所有线程。如何在不退出当前内核的情况下同步网格中的所有 block ？在 C-Cuda 中有一个 cooperati
numba - 如何在协作室中使用numba
有人尝试在Google合作伙伴中使用numba吗？我只是不知道如何在此环境中进行设置。此刻，我陷入了错误library nvvm not found。最佳答案将此代码复制到单元格中。这个对我有用
python - Numba:回退到对象模式时抑制错误
我想编写一个函数，它既可以作为 jitted 函数运行，也可以作为普通 python 或对象模式 numba 运行，具体取决于 numba 是否能够进行类型推断。我实际上更喜欢普通的 python，但
list - Numba 从列表创建元组
我有一个非常简单的问题我无法解决。我正在使用 Numba 和 Cuda。我有一个列表 T=[1.0,2.0,3.0,4.0,5.0,6.0,7.0,8.0,9.0] 我想要一个包含列表元素的元组，如
python - Numba 没有提高性能
我正在测试一些采用 numpy 数组的函数的 numba 性能，并比较: import numpy as np from numba import jit, vectorize, float64 im
python - 插值 - Numba
我正在使用 Scipy 的 interpolate.interp1d 在 Python3 中插入一维数组。我想将它与 numba 一起使用，但不支持 scipy 和此功能。是否有 numba 支持
python - Numba 并行代码比顺序代码慢
我是 Numba 的新手，我正在尝试使用 Numba(版本 0.54.1)在 Python 中实现旧的 Fortran 代码，但是当我添加 parallel = True 时，程序实际上变慢了.我的程
python - Numba 的最佳可能位数组
我需要在 Python 中创建一个位数组。到目前为止，我发现可以使用 bitarray 生成非常节省内存的数组。模块。然而，我的最终目的是使用来自Numba 的@vectorize 装饰器。 . N
python - Numba - 字符串类型
我认为这是一个简单的问题，但我发现 numba 文档缺乏关于如何将字符串类型与 numpy 数组和字典一起使用的信息。我有一个我想使用 numba 的函数，它需要一个邮政编码列表，然后是一个映射邮政编
python - 如何在多个功能中最佳地使用 numba？
假设我有两个功能 def my_sub1(a): return a + 2 def my_main(a): a += 1 b = mysub1(a) return b
python - numba 编译逻辑比较中的性能损失
在以下用于逻辑比较的 numba 编译函数中，性能下降的原因可能是什么: from numba import njit t = (True, 'and_', False) #@njit(boolean
python - Numba 中的笛卡尔积
我的代码使用如下列表的笛卡尔积: import itertools cartesian_product = itertools.product(list('ABCDEF'), repeat=n) n可
gpu - Numba 中的组合矢量化函数
我正在使用 Numba(版本 0.37.0)来优化 GPU 代码。我想使用组合矢量化函数(使用 Numba 的 @vectorize 装饰器)。导入和数据: import numpy as np f
python - numba 中两个列表的交集
我想知道在 numba 函数中计算两个列表的交集的最快方法。只是为了澄清:两个列表的交集示例: Input : lst1 = [15, 9, 10, 56, 23, 78, 5, 4, 9] lst2
python - Numba 函数与类型参数的使用无效
我正在使用 Numba 非 python 模式和一些 NumPy 函数。 @njit def invert(W, copy=True): ''' Inverts elementwise

太空狗

个人简介

我是一名优秀的程序员,十分优秀！

作者热门文章

滴滴打车优惠券免费领取

全站热门文章

首页

博学

6Ren·AI

商城

python - 如何确定 numba 的 prange 是否真的正常工作？

1.10.2. Explicit Parallel Loops