gpt4 book ai didi

python - N点间距离计算的pyCUDA基准测试结果令人失望

转载 作者:太空狗 更新时间:2023-10-30 00:07:49 25 4
gpt4 key购买 nike

以下脚本是为进行基准测试而设置的。它使用欧几里得 L2 范数计算 N 个点之间的距离。实现了三种不同的例程:

  1. 使用 scipy.spatial.distance.pdist 函数的高级解决方案。
  2. 相当低级的 OpenMP 驱动的 scipy.weave.inline 解决方案。
  3. pyCUDA 驱动的 GPGPU 解决方案。

以下是使用 GTX660(2GB 内存)在 i5-3470(16GB 内存)上的基准测试结果:

    ------------
Scipy Pdist
Execution time: 3.01975 s
Frist five elements: [ 0.74968684 0.71457213 0.833188 0.48084545 0.86407363]
Last five elements: [ 0.65717077 0.76850474 0.29652017 0.856179 0.56074625]
------------
Weave Inline
Execution time: 2.48705 s
Frist five elements: [ 0.74968684 0.71457213 0.83318806 0.48084542 0.86407363]
Last five elements: [ 0.65717083 0.76850474 0.29652017 0.856179 0.56074625]
------------
pyCUDA
CUDA clock timing: 0.713028930664
Execution time: 2.04364 s
Frist five elements: [ 0.74968684 0.71457213 0.83318806 0.48084542 0.86407363]
Last five elements: [ 0.65717083 0.76850468 0.29652017 0.856179 0.56074625]
------------

我对 pyCUDA 的性能有点失望。由于我是 CUDA 的新手,所以这里可能缺少一些东西。那么问题的症结在哪里呢?我是否达到了全局内存带宽的限制? block 大小和网格大小的选择不当?

import numpy,time,math
import pycuda.autoinit
import pycuda.driver as drv
from pycuda.compiler import SourceModule
from scipy.spatial.distance import pdist
from scipy import weave

def weave_solution(x):
"""
OpenMP powered weave inline.
"""
N,DIM = numpy.shape(x)
L = ((N-1)**2+(N-1))/2
solution = numpy.zeros(L).astype(numpy.float32)
ncpu = 4
weave_omp = {'headers' : ['<omp.h>'],
'extra_compile_args': ['-fopenmp'],
'extra_link_args' : ['-lgomp']}
code = \
r'''
omp_set_num_threads(ncpu);
#pragma omp parallel
{
int j,d,pos;
float r=0.0;

#pragma omp for
for (int i=0; i<(N-1); i++){
for (j=(i+1); j<N; j++){
r = 0.0;
for (d=0; d<DIM; d++){
r += (x[i*DIM+d]-x[j*DIM+d])*(x[i*DIM+d]-x[j*DIM+d]);
}
pos = (i*N+j)-(i*(i+1)/2)-i-1;
solution[pos] = sqrt(r);
}
}

}
'''
weave.inline(code,['x','N','DIM','solution','ncpu'],**weave_omp)
return numpy.array(solution)

def scipy_solution(x):
"""
SciPy High-level function
"""
return pdist(x).astype(numpy.float32)

def cuda_solution(x):
"""
pyCUDA
"""
N,DIM = numpy.shape(x)
N = numpy.int32(N)
DIM = numpy.int32(DIM)
L = ((N-1)**2+(N-1))/2
solution = numpy.zeros(L).astype(numpy.float32)

start = drv.Event()
end = drv.Event()

mod = SourceModule("""
__global__ void distance(float *x,int N,int DIM,float *solution){

const int i = blockDim.x * blockIdx.x + threadIdx.x;

int j,d,pos;
float r=0.0;

if ( i < (N-1) ){

for (j=(i+1); j<N; j++){

r = 0.0;
for (d=0; d<DIM; d++){
r += (x[i*DIM+d]-x[j*DIM+d])*(x[i*DIM+d]-x[j*DIM+d]);
}
pos = (i*N+j)-(i*(i+1)/2)-i-1;
solution[pos] = sqrt(r);
}

}
}
""")


func = mod.get_function("distance")

start.record()
func(drv.In(x),N,DIM,drv.Out(solution),block=(192,1,1),grid=(192,1))
end.record()
end.synchronize()
secs = start.time_till(end)*1e-3

print "CUDA clock timing: ",secs
return solution

if __name__ == '__main__':

# Set up data points
N = 25000
DIM = 3
x = numpy.random.rand(N,DIM).astype(numpy.float32)

print "-"*12
# Scipy solution
print "Scipy Pdist"
stime = time.time()
spsolution = scipy_solution(x)
stime = time.time()-stime
print "Execution time: {0:.5f} s".format(stime)
print "Frist five elements:", spsolution[:5]
print "Last five elements:", spsolution[-5:]
print "-"*12

# Weave solution
print "Weave Inline"
wtime = time.time()
wsolution = weave_solution(x)
wtime = time.time()-wtime
print "Execution time: {0:.5f} s".format(wtime)
print "Frist five elements:", wsolution[:5]
print "Last five elements:", wsolution[-5:]
print "-"*12

# pyCUDA solution
print "pyCUDA"
ctime = time.time()
csolution = cuda_solution(x)
ctime = time.time()-ctime
print "Execution time: {0:.5f} s".format(ctime)
print "Frist five elements:", csolution[:5]
print "Last five elements:", csolution[-5:]
print "-"*12

编辑:

我添加了散列线

#!/usr/bin/env python

在文件的顶部并使其可执行。使用 weave.inlinescipy.spatial.distance.pdist 注释掉计算后,NVIDIA Visual Profiler 提示以下结果:

NVIDIA Visual Profiler

最佳答案

现在您有 192 个线程,每个线程更新 N-1 个位置,您可以轻松启动更多 block /线程。

你想要做的是代替这个循环 for (j=(i+1); j<N; j++){ ,将其替换为只执行内部循环的 N-1 个线程。

如果你想更进一步,你可以有 N-1 * DIM每个线程都在内部循环中执行语句,将结果存储到共享内存中,最后对其进行归约。参见 Optimizing Parallel Reduction in CUDA

看这一行:

r += (x[i*DIM+d]-x[j*DIM+d])*(x[i*DIM+d]-x[j*DIM+d]);

内存事务和模式不统一且合并。也不知道 nvcc 是否会将您的表达式优化为仅两个内存事务而不是此处显示的四个,因为我不知道 pycuda 是否将 -O3 传递给 nvcc。放(x[i*DIM+d]-x[j*DIM+d])放入一个寄存器变量中以确保并自己将其平方。

否则你也可以尝试把#pragma unroll如果可能,在每个 for 循环之前展开它们。

关于python - N点间距离计算的pyCUDA基准测试结果令人失望,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/14093907/

25 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com