gpt4 book ai didi

c++ - 使用 nvprof 进行分析时没有 GPU 事件

转载 作者:行者123 更新时间:2023-11-30 16:11:10 34 4
gpt4 key购买 nike

我在初始化数据的函数上运行nvprof.exe,调用三个内核并释放数据。所有的分析都应如此,我得到的结果是这样的:

==7956== Profiling application: .\a.exe
==7956== Profiling result:
GPU activities: 52.34% 25.375us 1 25.375us 25.375us 25.375us th_single_row_add(float*, float*, float*)
43.57% 21.120us 1 21.120us 21.120us 21.120us th_single_col_add(float*, float*, float*)
4.09% 1.9840us 1 1.9840us 1.9840us 1.9840us th_single_elem_add(float*, float*, float*)
API calls: 86.77% 238.31ms 9 26.479ms 14.600us 210.39ms cudaMallocManaged
12.24% 33.621ms 1 33.621ms 33.621ms 33.621ms cuDevicePrimaryCtxRelease
0.27% 730.80us 3 243.60us 242.10us 245.60us cudaLaunchKernel
0.15% 406.90us 3 135.63us 65.400us 170.80us cudaDeviceSynchronize
0.08% 229.70us 97 2.3680us 100ns 112.10us cuDeviceGetAttribute
0.08% 206.60us 1 206.60us 206.60us 206.60us cuModuleUnload
0.01% 19.700us 1 19.700us 19.700us 19.700us cuDeviceTotalMem
0.00% 6.8000us 1 6.8000us 6.8000us 6.8000us cuDeviceGetPCIBusId
0.00% 1.9000us 2 950ns 400ns 1.5000us cuDeviceGet
0.00% 1.8000us 3 600ns 400ns 800ns cuDeviceGetCount
0.00% 700ns 1 700ns 700ns 700ns cuDeviceGetName
0.00% 200ns 1 200ns 200ns 200ns cuDeviceGetUuid
0.00% 200ns 1 200ns 200ns 200ns cuDeviceGetLuid

==7956== Unified Memory profiling result:
Device "GeForce RTX 2060 SUPER (0)"
Count Avg Size Min Size Max Size Total Size Total Time Name
18 20.000KB 8.0000KB 32.000KB 360.0000KB 300.7000us Host To Device
24 20.000KB 8.0000KB 32.000KB 480.0000KB 2.647400ms Device To Host

如您所见,GPU 事件 中有三个内核。这是源代码:

void add_elem(int n) {
float *a, *b, *c1, *c2, *c3;
cudaMallocManaged(&a, n * n * sizeof(float));
cudaMallocManaged(&b, n * n * sizeof(float));
cudaMallocManaged(&c1, n * n * sizeof(float));
cudaMallocManaged(&c2, n * n * sizeof(float));
cudaMallocManaged(&c3, n * n * sizeof(float));

for (int i = 0; i < n*n; i++) {
a[i] = 1.0f;
b[i] = 2.0f;
c1[i] = 0.0f;
c2[i] = 0.0f;
c3[i] = 0.0f;
}

int blockSize = 256;
int numBlocks = (n*n + blockSize - 1) / blockSize;
th_single_elem_add<<<numBlocks, blockSize>>>(a, b, c1);
th_single_row_add<<<numBlocks, blockSize>>>(a, b, c2);
th_single_col_add<<<numBlocks, blockSize>>>(a, b, c3);

cudaDeviceSynchronize();

cudaFree(a);
cudaFree(b);
cudaFree(c1);
cudaFree(c2);
cudaFree(c3);
}

之后,我将初始化数据、内核调用和释放数据提取到单独的主机函数中,并再次调用nvprof。结果我只得到了有关 API 调用的信息,如下所示:

==18460== Profiling application: .\a.exe
==18460== Profiling result:
Type Time(%) Time Calls Avg Min Max Name
API calls: 81.86% 158.78ms 9 17.643ms 1.4000us 158.76ms cudaMallocManaged
0.17% 322.80us 97 3.3270us 100ns 158.00us cuDeviceGetAttribute
0.11% 214.50us 1 214.50us 214.50us 214.50us cuModuleUnload
0.04% 68.600us 3 22.866us 7.3000us 39.400us cudaDeviceSynchronize
0.01% 12.100us 9 1.3440us 400ns 7.9000us cudaFree
0.00% 7.7000us 1 7.7000us 7.7000us 7.7000us cuDeviceGetPCIBusId
0.00% 2.1000us 3 700ns 300ns 1.0000us cuDeviceGetCount
0.00% 2.0000us 2 1.0000us 300ns 1.7000us cuDeviceGet
0.00% 1.2000us 3 400ns 300ns 500ns cudaLaunchKernel
0.00% 700ns 1 700ns 700ns 700ns cuDeviceGetName
0.00% 300ns 1 300ns 300ns 300ns cuDeviceGetUuid
0.00% 300ns 1 300ns 300ns 300ns cuDeviceGetLuid

如您所见,也没有统一内存分析结果部分,所以我尝试像这样运行 nvprof nvprof.exe --unified-memory-profiling off .\a.exe 但得到了相同的结果。其源代码:

void add_elem(int n) {
float *a, *b, *c1;
cudaMallocManaged(&a, n * n * sizeof(float));
cudaMallocManaged(&b, n * n * sizeof(float));
cudaMallocManaged(&c1, n * n * sizeof(float));

for (int i = 0; i < n*n; i++) {
a[i] = 1.0f;
b[i] = 2.0f;
c1[i] = 0.0f;
}

int blockSize = 256;
int numBlocks = (n*n + blockSize - 1) / blockSize;
th_single_elem_add<<<numBlocks, blockSize>>>(a, b, c1);

cudaDeviceSynchronize();

cudaFree(a);
cudaFree(b);
cudaFree(c1);
}

void add_row(int n) {
float *a, *b, *c1;
cudaMallocManaged(&a, n * n * sizeof(float));
cudaMallocManaged(&b, n * n * sizeof(float));
cudaMallocManaged(&c1, n * n * sizeof(float));

for (int i = 0; i < n*n; i++) {
a[i] = 1.0f;
b[i] = 2.0f;
c1[i] = 0.0f;
}

int blockSize = 256;
int numBlocks = (n + blockSize - 1) / blockSize;
th_single_row_add<<<numBlocks, blockSize>>>(a, b, c1, n);

cudaDeviceSynchronize();

cudaFree(a);
cudaFree(b);
cudaFree(c1);
}

void add_col(int n) {
float *a, *b, *c1;
cudaMallocManaged(&a, n * n * sizeof(float));
cudaMallocManaged(&b, n * n * sizeof(float));
cudaMallocManaged(&c1, n * n * sizeof(float));

for (int i = 0; i < n*n; i++) {
a[i] = 1.0f;
b[i] = 2.0f;
c1[i] = 0.0f;
}

int blockSize = 256;
int numBlocks = (n + blockSize - 1) / blockSize;
th_single_col_add<<<numBlocks, blockSize>>>(a, b, c1, n);

cudaDeviceSynchronize();

cudaFree(a);
cudaFree(b);
cudaFree(c1);
}

更新:我发现了问题,我运行数组中包含 10000000000 个元素的代码,似乎内核甚至没有被调用。因为我使用 10000000 (10^8) 个元素运行它们,花了近 3 秒才能完成,而使用 10000000000 (10^10) 个元素则立即完成。但没有任何错误哦。

我应该如何捕获此类案例?

最佳答案

这里的原因是使用不受支持的 <<<numBlocks, blockSize>>> 调用内核。 。添加gpuErrchk( cudaPeekAtLastError() );后每次内核调用后,我得到 GPUassert: invalid configuration argument ,这意味着我的 GPU 不受支持 numBlocksblockSize参数。如果没有错误检查,脚本就会默默结束。正如 Robber Corvella 在评论中建议的那样,这是正确的错误处理链接:

proper CUDA error checking

此外,正在运行 cuda-memcheck帮助

关于c++ - 使用 nvprof 进行分析时没有 GPU 事件,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/58682741/

34 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com