gpt4 book ai didi

python - 在主循环外广播加速矢量化 numpy 操作?

转载 作者:太空宇宙 更新时间:2023-11-03 11:37:04 25 4
gpt4 key购买 nike

我正在使用 numpy 做一些向量化代数,我的算法的挂钟性能似乎很奇怪。该程序大致做了如下工作:

  1. 创建三个矩阵:Y (KxD), X (NxD), T (KxN)
  2. 对于 Y 的每一行:
  3. X 的每一行中减去 Y[i](通过广播),
  4. 对一个轴上的差异求平方,求和,取平方根,然后存储在 T 中。

然而,根据我执行广播的方式,计算速度有很大不同。考虑代码:

import numpy as np
from time import perf_counter

D = 128
N = 3000
K = 500

X = np.random.rand(N, D)
Y = np.random.rand(K, D)
T = np.zeros((K, N))

if True: # negate to enable the second loop
time = 0.0
for i in range(100):
start = perf_counter()
for i in range(K):
T[i] = np.sqrt(np.sum(
np.square(
X - Y[i] # this has dimensions NxD
),
axis=1
))
time += perf_counter() - start
print("Broadcast in line: {:.3f} s".format(time / 100))
exit()

if True:
time = 0.0
for i in range(100):
start = perf_counter()
for i in range(K):
diff = X - Y[i]
T[i] = np.sqrt(np.sum(
np.square(
diff
),
axis=1
))
time += perf_counter() - start
print("Broadcast out: {:.3f} s".format(time / 100))
exit()

每个循环的时间都是单独测量的,并且是 100 次执行的平均值。结果:

Broadcast in line: 1.504 s
Broadcast out: 0.438 s

唯一的区别是第一个循环中的广播和减法是在线完成的,而在第二种方法中,我在任何向量化操作之前进行。为什么这会产生如此大的不同?

我的系统配置:

  • Lenovo ThinkStation P920、2x Xeon Silver 4110、64 GB 内存
  • Xubuntu 18.04.2 LTS(仿生)
  • Python 3.7.3 (GCC 7.3.0)
  • Numpy 1.16.3 链接到 OpenBLAS(正如 np.__config__.show() 告诉我的那样)

PS:是的,我知道这可以进一步优化,但现在我想了解这里发生了什么。

最佳答案

不是广播问题

我还添加了一个优化的解决方案,以查看在没有大量内存分配和释放开销的情况下实际计算需要多长时间。

函数

import numpy as np
import numba as nb

def func_1(X,Y,T):
for i in range(K):
T[i] = np.sqrt(np.sum(np.square(X - Y[i]),axis=1))
return T

def func_2(X,Y,T):
for i in range(K):
diff = X - Y[i]
T[i] = np.sqrt(np.sum(np.square(diff),axis=1))
return T

@nb.njit(fastmath=True,parallel=True)
def func_3(X,Y,T):
for i in nb.prange(Y.shape[0]):
for j in range(X.shape[0]):
diff_sq_sum=0.
for k in range(X.shape[1]):
diff_sq_sum+= (X[j,k] - Y[i,k])**2
T[i,j]=np.sqrt(diff_sq_sum)
return T

时间

我在 Jupyter Notebook 中进行了所有计时,并观察到一个非常奇怪的行为。以下代码在一个单元格中。我也尝试多次调用 timit,但在第一次执行单元格时,这并没有改变任何东西。

单元格的第一次执行

D = 128
N = 3000
K = 500

X = np.random.rand(N, D)
Y = np.random.rand(K, D)
T = np.zeros((K, N))

#You can do it more often it would not change anything
%timeit func_1(X,Y,T)
%timeit func_1(X,Y,T)

#You can do it more often it would not change anything
%timeit func_2(X,Y,T)
%timeit func_2(X,Y,T)

###Avoid measuring compilation overhead###
%timeit func_3(X,Y,T)
##########################################
%timeit func_3(X,Y,T)

774 ms ± 6.81 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
768 ms ± 2.88 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
494 ms ± 2.09 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
494 ms ± 1.06 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
10.7 ms ± 1.25 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
6.74 ms ± 39.8 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

第二次执行

345 ms ± 16.9 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
337 ms ± 3.72 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
322 ms ± 834 µs per loop (mean ± std. dev. of 7 runs, 1 loop each)
323 ms ± 1.15 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
6.93 ms ± 234 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
6.9 ms ± 87.4 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

关于python - 在主循环外广播加速矢量化 numpy 操作?,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/57312092/

25 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com