gpt4 book ai didi

c++ - 中型网格(>760 x 760)上嵌套循环中 atomicadd 的 CUDA 问题

转载 作者:行者123 更新时间:2023-11-28 04:59:55 25 4
gpt4 key购买 nike

我的 CUDA 程序中出现未知错误,它似乎与 atomicadd 函数有关。我在 Visual Studio 2015 上的 Windows 上进行编码。我的调用函数指定如下

int regionWidth=32;
int regionHeight=32;
dim3 gridSize(765,765);
dim3 blockSize(regionWidth, regionHeight);

cudaMalloc((void **)&dev_count, sizeof(int));
count = 0;
cudaMemcpy(dev_count, &count, sizeof(int), cudaMemcpyHostToDevice);

crashFN << < gridSize, blockSize >> > (regionWidth, regionHeight, dev_count);

cudaMemcpy(&count, dev_count, sizeof(int), cudaMemcpyDeviceToHost);

printf("total number of threads that executed was: %d vs. %d called -> %s\n", count, gridSize.x*gridSize.y*blockSize.x*blockSize.y, (count==gridSize.x*gridSize.y*blockSize.x*blockSize.y)?"ok":"error");

那么我的全局核函数是

 __global__ 
void crashFN(int regionWidth, int regionHeight, int* ct)
{
__shared__ int shared_sum;

shared_sum = 0;

sumGlobal(regionWidth, regionHeight, &shared_sum);

atomicAdd(ct, 1);
}

将 sumGlobal 定义为

 __device__
void sumGlobal(int regionWidth, int regionHeight, int* global_sum)
{
// sum in nested loop
for (int y = 0; y < regionHeight; y++)
for (int x = 0; x < regionWidth; x++)
atomicAdd(global_sum, 1);
}

程序的构建输出如下

1>  H:\GPU\GPU_PROJECT_HZDR\targeterConsole>"C:\Program Files\NVIDIA GPU 
Computing Toolkit\CUDA\v8.0\bin\nvcc.exe" -
gencode=arch=compute_50,code=\"sm_50,compute_50\" --use-local-env --cl-
version 2015 -ccbin "C:\Program Files (x86)\Microsoft Visual Studio
14.0\VC\bin\x86_amd64" -I"C:\Program Files\NVIDIA GPU Computing
Toolkit\CUDA\v8.0\include" -I"C:\Program Files\NVIDIA GPU Computing
Toolkit\CUDA\v8.0\include" --keep-dir x64\Release -maxrregcount=0 --
machine 64 --compile -cudart static -DWIN32 -DWIN64 -DNDEBUG -D_CONSOLE
-D_MBCS -Xcompiler "/EHsc /W3 /nologo /O2 /FS /Zi /MD " -o
x64\Release\targetDetectionGPU.cu.obj
"H:\GPU\GPU_PROJECT_HZDR\targetDetectionGPU.cu"

这是一个标准的 Nvidia CUDA 控制台项目,只是将 arch 更改为 sm_50,compute_50

我的程序的输出如下(带有调试信息)

sharedMemBytes=36864
regionWidth=32 regionHeight=32 coDIMX=16 coDIMY=16 coDIMZ=32
gridSize.x=765 gridSize.y=765 blockSize.x=32 blockSize.y=32
There is 1 device supporting CUDA

Device 0: "GeForce GTX 1050 Ti"
CUDA Driver Version: 9.0
CUDA Runtime Version: 8.0
CUDA Capability Major revision number: 6
CUDA Capability Minor revision number: 1
Total amount of global memory: 0 bytes
Number of multiprocessors: 6
Number of cores: 288
Total amount of constant memory: 65536 bytes
Total amount of shared memory per block: 49152 bytes
Total number of registers available per block: 65536
Warp size: 32
Maximum number of threads per block: 1024
Maximum sizes of each dimension of a block: 1024 x 1024 x 64
Maximum sizes of each dimension of a grid: 2147483647 x 65535 x 65535
Maximum memory pitch: 2147483647 bytes
Texture alignment: 512 bytes
Clock rate: 1.39 GHz
Concurrent copy and execution: Yes
Run time limit on kernels: Yes
Integrated: No
Support host page-locked memory mapping: Yes
Compute mode: Default (multiple host
threads can use this device simultaneously)
Concurrent kernel execution: Yes
Device has ECC support enabled: No

deviceQuery, CUDA Driver = CUDART, CUDA Driver Version = 9.0, CUDA Runtime
Version = 8.0, NumDevs = 1, Device = GeForce GTX 1050 Ti
Requested resources: gridSize.x=765 gridSize.y=765 blockSize.x=32
blockSize.y=32 sharedMemory=36 MB
total number of threads that executed was: 0 vs. 599270400 called -> error
file=H:/GPU/GPU_PROJECT_HZDR/targetDetectionGPU.cu line 558 CUDA Runtime API
error (30): unknown error
file=H:/GPU/GPU_PROJECT_HZDR/targetDetectionGPU.cu line 573 CUDA Runtime API
error (30): unknown error
finshed cuda algorithm

网格尺寸越小,效果越好

所以当我改为选择 764、764 网格大小时,我得到了

Requested resources: gridSize.x=764 gridSize.y=764 blockSize.x=32 
blockSize.y=32 sharedMemory=36 MB
total number of threads that executed was: 597704704 vs. 597704704 called ->
ok
file=H:/GPU/GPU_PROJECT_HZDR/targetDetectionGPU.cu line 574 CUDA Runtime API
error (30): unknown error

对于 750 x 750 错误消失了,对于 760x760 错误又回来了。

设备规范允许比 765 大得多的网格尺寸,或者我在这里遗漏了什么?不确定为什么嵌套循环中的简单 atomicAdd 会导致这些错误,这是错误吗?

好的,现在简化了内核调用,删除了函数调用并将循环合并为 1,但在较大的网格尺寸上仍然存在错误,如果我注释掉循环它运行正常。

__global__ 
void crashFN(int regionWidth, int regionHeight, int* ct)
{
__shared__ int shared_sum;

shared_sum = 0;
__syncthreads();

for (int y = 0; y < regionHeight*regionWidth; y++)
atomicAdd(&shared_sum, 1);

__syncthreads();

atomicAdd(ct, 1);
}

如果我将循环缩短为

  for (int y = 0; y < regionHeight; y++)
atomicAdd(&shared_sum, 1);

然后它工作正常,似乎是超时问题,很奇怪,因为我使用 NSight 监视器将 WDDM TDR 超时设置为 10 秒。

最佳答案

如果您收到“错误 (30):未知错误”,则怀疑 TDR 超时,尤其是在 Windows 上。基本上我的测试程序在循环中花费了很长时间并导致超时。当您使用 printf 语句进行调试时尤其如此!

解决方案是通过将 TDR 设置更改为 30 秒左右来增加超时值,当您没有将 GPU 卡用于主显示器时,增加该值不是问题。当 TDR 值增加时,您可以更好地看出是您的程序花费的时间太长,而不是其他原因。尝试通过删除循环来改进代码,尤其是那些包含原子操作的循环,或者重组代码以使用缩减等技术。

http://developer.download.nvidia.com/compute/cuda/1.1-Beta/x86_website/projects/reduction/doc/reduction.pdf

关于c++ - 中型网格(>760 x 760)上嵌套循环中 atomicadd 的 CUDA 问题,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/46321376/

25 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com