python - 我可以使用 Numba、矢量化或多处理加速这种空气动力学计算吗？-6ren

python - 我可以使用 Numba、矢量化或多处理加速这种空气动力学计算吗？

转载作者：行者123 更新时间：2023-12-03 16:31:11

问题:
我正在尝试提高 Python 中空气动力学函数的速度。
功能集:

import numpy as np
from numba import njit

def calculate_velocity_induced_by_line_vortices(
    points, origins, terminations, strengths, collapse=True
):

    # Expand the dimensionality of the points input. It is now of shape (N x 1 x 3).
    # This will allow NumPy to broadcast the upcoming subtractions.
    points = np.expand_dims(points, axis=1)
    
    # Define the vectors from the vortex to the points. r_1 and r_2 now both are of
    # shape (N x M x 3). Each row/column pair holds the vector associated with each
    # point/vortex pair.
    r_1 = points - origins
    r_2 = points - terminations
    
    r_0 = r_1 - r_2
    r_1_cross_r_2 = nb_2d_explicit_cross(r_1, r_2)
    r_1_cross_r_2_absolute_magnitude = (
        r_1_cross_r_2[:, :, 0] ** 2
        + r_1_cross_r_2[:, :, 1] ** 2
        + r_1_cross_r_2[:, :, 2] ** 2
    )
    r_1_length = nb_2d_explicit_norm(r_1)
    r_2_length = nb_2d_explicit_norm(r_2)
    
    # Define the radius of the line vortices. This is used to get rid of any
    # singularities.
    radius = 3.0e-16
    
    # Set the lengths and the absolute magnitudes to zero, at the places where the
    # lengths and absolute magnitudes are less than the vortex radius.
    r_1_length[r_1_length < radius] = 0
    r_2_length[r_2_length < radius] = 0
    r_1_cross_r_2_absolute_magnitude[r_1_cross_r_2_absolute_magnitude < radius] = 0
    
    # Calculate the vector dot products.
    r_0_dot_r_1 = np.einsum("ijk,ijk->ij", r_0, r_1)
    r_0_dot_r_2 = np.einsum("ijk,ijk->ij", r_0, r_2)
    
    # Calculate k and then the induced velocity, ignoring any divide-by-zero or nan
    # errors. k is of shape (N x M)
    with np.errstate(divide="ignore", invalid="ignore"):
        k = (
            strengths
            / (4 * np.pi * r_1_cross_r_2_absolute_magnitude)
            * (r_0_dot_r_1 / r_1_length - r_0_dot_r_2 / r_2_length)
        )
    
        # Set the shape of k to be (N x M x 1) to support numpy broadcasting in the
        # subsequent multiplication.
        k = np.expand_dims(k, axis=2)
    
        induced_velocities = k * r_1_cross_r_2
    
    # Set the values of the induced velocity to zero where there are singularities.
    induced_velocities[np.isinf(induced_velocities)] = 0
    induced_velocities[np.isnan(induced_velocities)] = 0

    if collapse:
        induced_velocities = np.sum(induced_velocities, axis=1)

    return induced_velocities


@njit    
def nb_2d_explicit_norm(vectors):
    return np.sqrt(
        (vectors[:, :, 0]) ** 2 + (vectors[:, :, 1]) ** 2 + (vectors[:, :, 2]) ** 2
    )


@njit
def nb_2d_explicit_cross(a, b):
    e = np.zeros_like(a)
    e[:, :, 0] = a[:, :, 1] * b[:, :, 2] - a[:, :, 2] * b[:, :, 1]
    e[:, :, 1] = a[:, :, 2] * b[:, :, 0] - a[:, :, 0] * b[:, :, 2]
    e[:, :, 2] = a[:, :, 0] * b[:, :, 1] - a[:, :, 1] * b[:, :, 0]
    return e

语境:
Ptera Software 使用此功能，一个用于扑翼空气动力学的开源求解器。如下面的配置文件输出所示，它是迄今为止 Ptera Software 运行时间的最大贡献者。

目前，Ptera Software 运行一个典型案例只需 3 多分钟，我的目标是在 1 分钟内完成。
该函数接受一组点、起点、终点和强度。在每个点上，它都会找到由线涡流引起的诱导速度，这些线涡流的特征在于起点、终点和强度的组。如果塌陷为真，则输出是由于涡流在每个点处引起的累积速度。如果为 false，则函数输出每个涡旋对每个点的速度的贡献。
在典型的运行过程中，速度函数被调用大约 2000 次。起初，调用涉及具有相对较小输入参数(大约 200 个点、起点、终点和强度)的向量。后来的调用涉及大量输入参数(大约 400 个点和大约 6,000 个起源、终止和强度)。一个理想的解决方案对于所有大小的输入都是快速的，但是提高大输入调用的速度更为重要。
对于测试，我建议使用您自己的函数实现运行以下脚本:

import timeit

import matplotlib.pyplot as plt
import numpy as np

n_repeat = 2
n_execute = 10 ** 3
min_oom = 0
max_oom = 3

times_py = []

for i in range(max_oom - min_oom + 1):
    n_elem = 10 ** i
    n_elem_pretty = np.format_float_scientific(n_elem, 0)
    print("Number of elements: " + n_elem_pretty)

    # Benchmark Python.
    print("\tBenchmarking Python...")
    setup = '''
import numpy as np

these_points = np.random.random((''' + str(n_elem) + ''', 3))
these_origins = np.random.random((''' + str(n_elem) + ''', 3))
these_terminations = np.random.random((''' + str(n_elem) + ''', 3))
these_strengths = np.random.random(''' + str(n_elem) + ''')

def calculate_velocity_induced_by_line_vortices(points, origins, terminations,
                                                strengths, collapse=True):
    pass
    '''
    statement = '''
results_orig = calculate_velocity_induced_by_line_vortices(these_points, these_origins,
                                                           these_terminations,
                                                           these_strengths)
    '''
    
    times = timeit.repeat(repeat=n_repeat, stmt=statement, setup=setup, number=n_execute)
    time_py = min(times)/n_execute
    time_py_pretty = np.format_float_scientific(time_py, 2)
    print("\t\tAverage Time per Loop: " + time_py_pretty + " s")

    # Record the times.
    times_py.append(time_py)

sizes = [10 ** i for i in range(max_oom - min_oom + 1)]

fig, ax = plt.subplots()

ax.plot(sizes, times_py, label='Python')
ax.set_xscale("log")
ax.set_xlabel("Size of List or Array (elements)")
ax.set_ylabel("Average Time per Loop (s)")
ax.set_title(
    "Comparison of Different Optimization Methods\nBest of "
    + str(n_repeat)
    + " Runs, each with "
    + str(n_execute)
    + " Loops"
)
ax.legend()
plt.show()

以前的尝试:
我之前加速这个函数的尝试包括对其进行矢量化(效果很好，所以我保留了这些更改)并尝试了 Numba 的 JIT 编译器。我对 Numba 的结果好坏参半。当我尝试在整个速度函数的修改版本上使用 Numba 时，我的结果比以前慢得多。但是，我发现 Numba 显着加快了我在上面实现的叉积和范数函数。
更新:
更新1:
根据 Mercury 的评论(已被删除)，我替换了

points = np.expand_dims(points, axis=1)
r_1 = points - origins
r_2 = points - terminations

两次调用以下函数:

@njit
def subtract(a, b):
    c = np.empty((a.shape[0], b.shape[0], 3))
    for i in range(a.shape[0]):
        for j in range(b.shape[0]):
            for k in range(3):
                c[i, j, k] = a[i, k] - b[j, k]
    return c

这导致速度从 227 秒增加到 220 秒。这个更好!但是，它仍然不够快。
我还尝试将 njit fastmath 标志设置为 true，并使用 numba 函数而不是调用 np.einsum。都没有提高速度。
更新 2:
有了 Jérôme Richard 的回答，运行时间现在是 156 秒，减少了 29%!我很满意接受这个答案，但如果您认为可以改进他们的工作，请随时提出其他建议!

最佳答案

首先Numba可以执行并行计算 如果您主要使用 parallel=True 手动请求它，则会产生更快的代码和 prange .这对大数组很有用(但对小数组没有用)。
而且，你的计算主要是内存限制 .因此，当它们没有被多次重用时，或者更普遍地，当它们不能被重新计算时(以相对便宜的方式)，你应该避免创建大数组。 r_0 就是这种情况。例如。
此外，内存访问模式重要:当访问为 时，向量化更有效连续 在内存中，缓存/RAM的使用效率更高。因此，arr[0, :, :] = 0应该比 arr[:, :, 0] = 0 更快.同样，arr[:, :, 0] = arr[:, :, 1] = 0应该比 arr[:, :, 0:2] = 0 慢因为前者执行不连续的内存传递，而后者只执行一个更连续的内存传递。有时，转置数据可能会有所帮助，以便以下计算更快。
此外，Numpy 往往会创建许多 临时数组 分配成本很高。当输入数组很小时，这是一个大问题。在大多数情况下，Numba jit 可以避免这种情况。
最后，关于您的计算，使用 可能是个好主意。 GPU 对于大数组(绝对不是小数组)。可以看一下丘比或 clpy 很容易做到这一点。
这是在 CPU 上工作的优化实现:

import numpy as np
from numba import njit, prange

@njit(parallel=True)
def subtract(a, b):
    c = np.empty((a.shape[0], b.shape[0], 3))
    for i in prange(c.shape[0]):
        for j in range(c.shape[1]):
            for k in range(3):
                c[i, j, k] = a[i, k] - b[j, k]
    return c

@njit(parallel=True)
def nb_2d_explicit_norm(vectors):
    res = np.empty((vectors.shape[0], vectors.shape[1]))
    for i in prange(res.shape[0]):
        for j in range(res.shape[1]):
            res[i, j] = np.sqrt(vectors[i, j, 0] ** 2 + vectors[i, j, 1] ** 2 + vectors[i, j, 2] ** 2)
    return res

# NOTE: better memory access pattern
@njit(parallel=True)
def nb_2d_explicit_cross(a, b):
    e = np.empty(a.shape)
    for i in prange(e.shape[0]):
        for j in range(e.shape[1]):
            e[i, j, 0] = a[i, j, 1] * b[i, j, 2] - a[i, j, 2] * b[i, j, 1]
            e[i, j, 1] = a[i, j, 2] * b[i, j, 0] - a[i, j, 0] * b[i, j, 2]
            e[i, j, 2] = a[i, j, 0] * b[i, j, 1] - a[i, j, 1] * b[i, j, 0]
    return e

# NOTE: avoid the slow building of temporary arrays
@njit(parallel=True)
def cross_absolute_magnitude(cross):
    return cross[:, :, 0] ** 2 + cross[:, :, 1] ** 2 + cross[:, :, 2] ** 2

# NOTE: avoid the slow building of temporary arrays again and multiple pass in memory
# Warning: do the work in-place
@njit(parallel=True)
def discard_singularities(arr):
    for i in prange(arr.shape[0]):
        for j in range(arr.shape[1]):
            for k in range(3):
                if np.isinf(arr[i, j, k]) or np.isnan(arr[i, j, k]):
                    arr[i, j, k] = 0.0

@njit(parallel=True)
def compute_k(strengths, r_1_cross_r_2_absolute_magnitude, r_0_dot_r_1, r_1_length, r_0_dot_r_2, r_2_length):
    return (strengths
        / (4 * np.pi * r_1_cross_r_2_absolute_magnitude)
        * (r_0_dot_r_1 / r_1_length - r_0_dot_r_2 / r_2_length)
    )

@njit(parallel=True)
def rDotProducts(b, c):
    assert b.shape == c.shape and b.shape[2] == 3
    n, m = b.shape[0], b.shape[1]
    ab = np.empty((n, m))
    ac = np.empty((n, m))
    for i in prange(n):
        for j in range(m):
            ab[i, j] = 0.0
            ac[i, j] = 0.0
            for k in range(3):
                a = b[i, j, k] - c[i, j, k]
                ab[i, j] += a * b[i, j, k]
                ac[i, j] += a * c[i, j, k]
    return (ab, ac)

# Compute `np.sum(arr, axis=1)` in parallel.
@njit(parallel=True)
def collapseArr(arr):
    assert arr.shape[2] == 3
    n, m = arr.shape[0], arr.shape[1]
    res = np.empty((n, 3))
    for i in prange(n):
        res[i, 0] = np.sum(arr[i, :, 0])
        res[i, 1] = np.sum(arr[i, :, 1])
        res[i, 2] = np.sum(arr[i, :, 2])
    return res

def calculate_velocity_induced_by_line_vortices(points, origins, terminations, strengths, collapse=True):
    r_1 = subtract(points, origins)
    r_2 = subtract(points, terminations)
    # NOTE: r_0 is computed on the fly by rDotProducts

    r_1_cross_r_2 = nb_2d_explicit_cross(r_1, r_2)

    r_1_cross_r_2_absolute_magnitude = cross_absolute_magnitude(r_1_cross_r_2)

    r_1_length = nb_2d_explicit_norm(r_1)
    r_2_length = nb_2d_explicit_norm(r_2)

    radius = 3.0e-16
    r_1_length[r_1_length < radius] = 0
    r_2_length[r_2_length < radius] = 0
    r_1_cross_r_2_absolute_magnitude[r_1_cross_r_2_absolute_magnitude < radius] = 0

    r_0_dot_r_1, r_0_dot_r_2 = rDotProducts(r_1, r_2)

    with np.errstate(divide="ignore", invalid="ignore"):
        k = compute_k(strengths, r_1_cross_r_2_absolute_magnitude, r_0_dot_r_1, r_1_length, r_0_dot_r_2, r_2_length)
        k = np.expand_dims(k, axis=2)
        induced_velocities = k * r_1_cross_r_2

    discard_singularities(induced_velocities)

    if collapse:
        induced_velocities = collapseArr(induced_velocities)

    return induced_velocities

在我的机器上，此代码是 快 2.5 倍 比在 10**3 大小的数组上的初始实现.它还使用了一点 内存少 .

关于python - 我可以使用 Numba、矢量化或多处理加速这种空气动力学计算吗？，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/66750661/

文章推荐： performance - 当优化不再是 "Micro-optimization"

文章推荐： objective-c - NSTextView 与 NSButton 重叠并使其不可点击

文章推荐： batch-file - 从批处理脚本中的字符串中提取前导数字

文章推荐： objective-c - 如何在 Mac 上禁用 cmd+q

android - 向应用程序发送通知。 (空气)
是否可以在移动设备中使用 Air 从我自己的服务器向设备中安装的应用程序发送通知，以便它显示通知符号？最佳答案这取决于您到底想要什么。通知可以有 2 种类型:- 本地(从设备推送)- 远程(从服务
php - 关于 Adobe 空气
我对 Air 很陌生，我知道它用于使用 AS3 创建桌面应用程序。所以我的问题是，如果我创建 Air 桌面应用程序，是否可以通过 Web 应用程序(即使用 js 或 php)启动它？有人有想法吗？
java - 空气 NativeProcess java
我想与 AIR 项目中的 .jar 文件交互。为此，我使用 AIR2 中的 NativeProcess 功能，但似乎我不能只将“myJavaFile.jar”定义为可执行文件。在 Windows 上
ios - 空气 + iOS : what is the difference between iPad1 and iPad2
我有一个适用于 iPad1 的 AIR 应用程序，但它甚至无法安装在 iPad2 上(应用程序是通过 iTunes 部署的)。在 iPad 上显示类似“无法安装 APPID”的内容。问题是我没有iPa
javascript - Adobe 空气 : draw vector graphics
Air 是否支持 SVG，如果不可能，是否有另一种方法可以通过 JavaScript 绘制事件敏感图形。最佳答案但是我找到了一种使用 flash 在 html 上绘制矢量图形的方法 air.Sha
android - 空气 : Which event is triggered when barcode reader scanning?
我的平板电脑(运行 Android 2.2)连接了一个 USB 条形码阅读器，它似乎有效，但是: 如何从中读取日期？，以及扫描时触发哪个事件？最佳答案您可能需要某种原生扩展。关于android
apache-flex - 弹性/空气 : Sending email with embedded image. 。如何？
我正在制作一个 Flex AIR 应用程序，它将根据网络摄像头图片生成礼品卡。此礼品卡需要通过电子邮件发送给程序中提供的收件人。我应该将图片上传到服务器并使用 php 发送邮件吗？最佳答案您可以尝
string - Adobe 空气 : convert sqlite's result [object Object] to String?
我目前正在尝试从 sqlite 检索文本。我看到请求的数据量确实正确，但另一方面，内容的格式似乎不正确。我尝试了一些转换: var data:Array = sqls.getResult().data
xcode - 空气-lld : error: symbol(s) not found for target 'air64-apple-ios12.0.0' after update xcode to xcode 12
这里是错误: 体系结构“air64”的 undefined symbol :_Z6dnoiseDv2_fPU9MTLdeviceKi，引用自:nxNoise.air 中的_Z9curlNoiseDv2

行者123

个人简介

我是一名优秀的程序员,十分优秀！

作者热门文章

滴滴打车优惠券免费领取

全站热门文章

首页

博学

6Ren·AI

商城

python - 我可以使用 Numba、矢量化或多处理加速这种空气动力学计算吗？