python - 如何加速即使使用 Numba 也很慢的计算-6ren

python - 如何加速即使使用 Numba 也很慢的计算

转载作者：行者123 更新时间：2023-12-04 14:50:07

我的 Python 代码计算缓慢，遇到了麻烦。根据下面的 pycallgraph，瓶颈似乎是名为 miepython.miepython.mie_S1_S2 的模块(以粉红色突出显示)，每次调用需要 0.47 秒。

该模块源码如下:

import numpy as np
from numba import njit, int32, float64, complex128

__all__ = ('ez_mie',
           'ez_intensities',
           'generate_mie_costheta',
           'i_par',
           'i_per',
           'i_unpolarized',
           'mie',
           'mie_S1_S2',
           'mie_cdf',
           'mie_mu_with_uniform_cdf',
           )


@njit((complex128, float64, float64[:]), cache=True)
def _mie_S1_S2(m, x, mu):
    """
    Calculate the scattering amplitude functions for spheres.

    The amplitude functions have been normalized so that when integrated
    over all 4*pi solid angles, the integral will be qext*pi*x**2.

    The units are weird, sr**(-0.5)

    Args:
        m: the complex index of refraction of the sphere
        x: the size parameter of the sphere
        mu: array of angles, cos(theta), to calculate scattering amplitudes

    Returns:
        S1, S2: the scattering amplitudes at each angle mu [sr**(-0.5)]
    """
    nstop = int(x + 4.05 * x**0.33333 + 2.0) + 1
    a = np.zeros(nstop - 1, dtype=np.complex128)
    b = np.zeros(nstop - 1, dtype=np.complex128)
    _mie_An_Bn(m, x, a, b)

    nangles = len(mu)
    S1 = np.zeros(nangles, dtype=np.complex128)
    S2 = np.zeros(nangles, dtype=np.complex128)

    nstop = len(a)
    for k in range(nangles):
        pi_nm2 = 0
        pi_nm1 = 1
        for n in range(1, nstop):
            tau_nm1 = n * mu[k] * pi_nm1 - (n + 1) * pi_nm2

            S1[k] += (2 * n + 1) * (pi_nm1 * a[n - 1]
                                    + tau_nm1 * b[n - 1]) / (n + 1) / n

            S2[k] += (2 * n + 1) * (tau_nm1 * a[n - 1]
                                    + pi_nm1 * b[n - 1]) / (n + 1) / n

            temp = pi_nm1
            pi_nm1 = ((2 * n + 1) * mu[k] * pi_nm1 - (n + 1) * pi_nm2) / n
            pi_nm2 = temp

    # calculate norm = sqrt(pi * Qext * x**2)
    n = np.arange(1, nstop + 1)
    norm = np.sqrt(2 * np.pi * np.sum((2 * n + 1) * (a.real + b.real)))

    S1 /= norm
    S2 /= norm

    return [S1, S2]

显然，源代码是由 Numba 编译的，因此它应该比实际速度更快。此函数中for循环的迭代次数约为25,000 (len(mu)=50, len(a)-1=500 ).

关于如何加速这个计算有什么想法吗？是否有什么东西阻碍了 Numba 的快速计算？或者，您认为计算已经足够快了吗？

[更多详情]

在上面，使用了另一个函数_mie_An_Bn。这个函数也是jitted的，源码如下:

@njit((complex128, float64, complex128[:], complex128[:]), cache=True)
def _mie_An_Bn(m, x, a, b):
    """
    Compute arrays of Mie coefficients A and B for a sphere.

    This estimates the size of the arrays based on Wiscombe's formula. The length
    of the arrays is chosen so that the error when the series are summed is
    around 1e-6.

    Args:
        m: the complex index of refraction of the sphere
        x: the size parameter of the sphere

    Returns:
        An, Bn: arrays of Mie coefficents
    """
    psi_nm1 = np.sin(x)                   # nm1 = n-1 = 0
    psi_n = psi_nm1 / x - np.cos(x)       # n = 1
    xi_nm1 = complex(psi_nm1, np.cos(x))
    xi_n = complex(psi_n, np.cos(x) / x + np.sin(x))

    nstop = len(a)
    if m.real > 0.0:
        D = _D_calc(m, x, nstop + 1)

        for n in range(1, nstop):
            temp = D[n] / m + n / x
            a[n - 1] = (temp * psi_n - psi_nm1) / (temp * xi_n - xi_nm1)
            temp = D[n] * m + n / x
            b[n - 1] = (temp * psi_n - psi_nm1) / (temp * xi_n - xi_nm1)
            xi = (2 * n + 1) * xi_n / x - xi_nm1
            xi_nm1 = xi_n
            xi_n = xi
            psi_nm1 = psi_n
            psi_n = xi_n.real

    else:
        for n in range(1, nstop):
            a[n - 1] = (n * psi_n / x - psi_nm1) / (n * xi_n / x - xi_nm1)
            b[n - 1] = psi_n / xi_n
            xi = (2 * n + 1) * xi_n / x - xi_nm1
            xi_nm1 = xi_n
            xi_n = xi
            psi_nm1 = psi_n
            psi_n = xi_n.real

示例输入如下:

m = 1.336-2.462e-09j
x = 8526.95
mu = np.array([-1., -0.7500396, 0.46037385, 0.5988121, 0.67384093, 0.72468684, 0.76421644, 0.79175856, 0.81723714, 0.83962897, 0.85924182, 0.87641596, 0.89383665, 0.90708978, 0.91931481, 0.93067567, 0.94073113, 0.94961222, 0.95689496, 0.96467123,  0.97138347, 0.97791831, 0.98339434, 0.98870543, 0.99414948, 0.9975728   0.9989995, 0.9989995, 0.9989995, 0.9989995, 0.9989995,0.99899951, 0.99899951, 0.99899951, 0.99899951, 0.99899951, 0.99899951, 0.99899951, 0.99899951, 0.99899951, 0.99899952,  0.99899952,
  0.99899952,  0.99899952,  0.99899952,  0.99899952,  0.99899952, 0.99899952, 0.99899952,  1.        ])

最佳答案

我关注 _mie_S1_S2，因为它似乎是提供的示例数据集上最昂贵的函数。

首先，如果没有+Inf、-Inf、-0这样的值，你可以给JIT使用参数fastmath=True来加速计算或计算出的 NaN。

然后您可以预先计算一些包含除法或隐式整数到 float 转换的昂贵表达式。请注意 (2 * n + 1)/n = 2 + 1/n 和 (n + 1)/n = 1 + 1/n。这对于减少预计算数组的数量很有用，但不会改变我机器上的性能(这可能会因目标架构而改变)。另请注意，这样的预计算对结果准确性有轻微影响(大部分时间可以忽略不计，有时比引用实现更好)。

在我的机器上，此策略使代码在使用 fastmath=True 时快 4.5 倍，在没有时快 2.8 倍。

基于 k 的循环可以使用 Numba 的 parallel=True 和 prange 进行并行化。然而，这可能并不总是在所有机器上都更快(尤其是那些有很多内核的机器)，因为循环非常快。

这是最终代码:

@njit((complex128, float64, float64[:]), cache=True, parallel=True)
def _mie_S1_S2_opt(m, x, mu):
    nstop = int(x + 4.05 * x**0.33333 + 2.0) + 1
    a = np.zeros(nstop - 1, dtype=np.complex128)
    b = np.zeros(nstop - 1, dtype=np.complex128)
    _mie_An_Bn(m, x, a, b)

    nangles = len(mu)
    S1 = np.zeros(nangles, dtype=np.complex128)
    S2 = np.zeros(nangles, dtype=np.complex128)

    factor1 = np.empty(nstop, dtype=np.float64)
    factor2 = np.empty(nstop, dtype=np.float64)
    factor3 = np.empty(nstop, dtype=np.float64)

    for n in range(1, nstop):
        factor1[n - 1] = (2 * n + 1) / (n + 1) / n
        factor2[n - 1] = (2 * n + 1) / n
        factor3[n - 1] = (n + 1) / n

    nstop = len(a)
    for k in nb.prange(nangles):
        pi_nm2 = 0
        pi_nm1 = 1
        for n in range(1, nstop):
            i = n - 1
            tau_nm1 = n * mu[k] * pi_nm1 - (n + 1.0) * pi_nm2

            S1[k] += factor1[i] * (pi_nm1 * a[i] + tau_nm1 * b[i])
            S2[k] += factor1[i] * (tau_nm1 * a[i] + pi_nm1 * b[i])

            temp = pi_nm1
            pi_nm1 = factor2[i] * mu[k] * pi_nm1 - factor3[i] * pi_nm2
            pi_nm2 = temp

    # calculate norm = sqrt(pi * Qext * x**2)
    n = np.arange(1, nstop + 1)
    norm = np.sqrt(2 * np.pi * np.sum((2 * n + 1) * (a.real + b.real)))

    S1 /= norm
    S2 /= norm

    return [S1, S2]

%timeit -n 1000 _mie_S1_S2_opt(m, x, mu)

在我的 6 核机器上，最终优化的实现在使用 fastmath=True 时 快 12 倍，在没有时快 8.8 倍。请注意，在其他函数中使用类似的策略也可能有助于加快它们的速度。

关于python - 如何加速即使使用 Numba 也很慢的计算，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/69244843/

文章推荐： clojure - 如何使用 babashka 过滤 tail 的输出

文章推荐： sql - 从一列的多行创建一个逗号分隔的字符串？

文章推荐： .net - MVC3 默认模板在哪里存储用户帐户信息？

numba - 在 Numba 优化的 Python 中将类对象作为函数参数传递
我想将一个类对象传递给一个函数。我可以让它工作，但我想知道是否有我可以分配的类型？我有一个我正在尝试做的“最小”示例。 spec = [("a", float64),("b",float64)] @j
python - numba - 打字错误 : cannot determine Numba type of
我有一个简单的函数来对扑克手牌进行排序(手牌是字符串)。我用 rA,rB = rank(a),rank(b) 调用它，这是我的实现。没有 @jit(nopython=True) 也能很好地工作，但是
python - numpy 比 numba 和 cython 快，如何改进 numba 代码
我在这里有一个简单的例子来帮助我理解使用 numba 和 cython。我是 numba 和 cython 的新手。我已经尽力结合所有技巧来使 numba 更快，并且在某种程度上，cython 也是如
python - 如何使 numba @jit 使用所有 cpu 内核(并行化 numba @jit)
我正在使用 numbas @jit 装饰器在 python 中添加两个 numpy 数组。如果我使用 @jit 与 python 相比，性能是如此之高。然而，即使我传入 @numba.jit(nop
python - Numba jit nopython 模式 : tell numba the signature of an external arbitrary function
我需要为通用指标构建相异矩阵。由于我需要算法快速运行，所以我在 nopython 模式下使用了 numba 0.35。这是我的代码 import numpy as np from numba impo
python - Numba 支持 cuda 协作 block 同步？？ Python numba cuda 网格同步
Numba Cuda 有 syncthreads() 来同步一个 block 中的所有线程。如何在不退出当前内核的情况下同步网格中的所有 block ？在 C-Cuda 中有一个 cooperati
numba - 如何在协作室中使用numba
有人尝试在Google合作伙伴中使用numba吗？我只是不知道如何在此环境中进行设置。此刻，我陷入了错误library nvvm not found。最佳答案将此代码复制到单元格中。这个对我有用
python - Numba:回退到对象模式时抑制错误
我想编写一个函数，它既可以作为 jitted 函数运行，也可以作为普通 python 或对象模式 numba 运行，具体取决于 numba 是否能够进行类型推断。我实际上更喜欢普通的 python，但
list - Numba 从列表创建元组
我有一个非常简单的问题我无法解决。我正在使用 Numba 和 Cuda。我有一个列表 T=[1.0,2.0,3.0,4.0,5.0,6.0,7.0,8.0,9.0] 我想要一个包含列表元素的元组，如
python - Numba 没有提高性能
我正在测试一些采用 numpy 数组的函数的 numba 性能，并比较: import numpy as np from numba import jit, vectorize, float64 im
python - 插值 - Numba
我正在使用 Scipy 的 interpolate.interp1d 在 Python3 中插入一维数组。我想将它与 numba 一起使用，但不支持 scipy 和此功能。是否有 numba 支持
python - Numba 并行代码比顺序代码慢
我是 Numba 的新手，我正在尝试使用 Numba(版本 0.54.1)在 Python 中实现旧的 Fortran 代码，但是当我添加 parallel = True 时，程序实际上变慢了.我的程
python - Numba 的最佳可能位数组
我需要在 Python 中创建一个位数组。到目前为止，我发现可以使用 bitarray 生成非常节省内存的数组。模块。然而，我的最终目的是使用来自Numba 的@vectorize 装饰器。 . N
python - Numba - 字符串类型
我认为这是一个简单的问题，但我发现 numba 文档缺乏关于如何将字符串类型与 numpy 数组和字典一起使用的信息。我有一个我想使用 numba 的函数，它需要一个邮政编码列表，然后是一个映射邮政编
python - 如何在多个功能中最佳地使用 numba？
假设我有两个功能 def my_sub1(a): return a + 2 def my_main(a): a += 1 b = mysub1(a) return b
python - numba 编译逻辑比较中的性能损失
在以下用于逻辑比较的 numba 编译函数中，性能下降的原因可能是什么: from numba import njit t = (True, 'and_', False) #@njit(boolean
python - Numba 中的笛卡尔积
我的代码使用如下列表的笛卡尔积: import itertools cartesian_product = itertools.product(list('ABCDEF'), repeat=n) n可
gpu - Numba 中的组合矢量化函数
我正在使用 Numba(版本 0.37.0)来优化 GPU 代码。我想使用组合矢量化函数(使用 Numba 的 @vectorize 装饰器)。导入和数据: import numpy as np f
python - numba 中两个列表的交集
我想知道在 numba 函数中计算两个列表的交集的最快方法。只是为了澄清:两个列表的交集示例: Input : lst1 = [15, 9, 10, 56, 23, 78, 5, 4, 9] lst2
python - Numba 函数与类型参数的使用无效
我正在使用 Numba 非 python 模式和一些 NumPy 函数。 @njit def invert(W, copy=True): ''' Inverts elementwise

行者123

个人简介

我是一名优秀的程序员,十分优秀！

作者热门文章

滴滴打车优惠券免费领取

全站热门文章

首页

博学

6Ren·AI

商城

python - 如何加速即使使用 Numba 也很慢的计算