gpt4 book ai didi

c++ - 为什么我将分配代码放在函数中时会得到 'insufficient buffer space'?

转载 作者:行者123 更新时间:2023-11-28 04:04:56 33 4
gpt4 key购买 nike

所以我刚开始用 CUDA 编写,遵循 An Even Easier Introduction to CUDA指导。到目前为止,一切都很好。然后我想实现一个神经网络,这让我对函数 cudaMallocManaged() 进行了多次调用。因此,为了提高可读性,我决定将它们放在一个名为 allocateStuff() 的不同函数中(参见下面的代码)。然后,当我使用 nvprof 运行程序时,这不会显示 layerInit() 的 GPU 时间,而是给出以下警告:

Warning: 1 records have invalid timestamps due to insufficient device buffer space. You can configure the buffer space using the option --device-buffer-size.

但是,当我将代码直接放在 main() 中的 allocateStuff() 函数中时,警告不会发生,并且会显示 layerInit() 的 GPU 时间。所以现在我的问题是:我在这个函数中做错了什么,或者它(显然)溢出缓冲区的原因是什么?

代码:

#include <cuda_profiler_api.h>
#include <iostream>
#include <vector>

__global__
void layerInit(const unsigned int firstNodes,
const unsigned int secondNodes,
const unsigned int resultNodes,
float *firstLayer,
float *secondLayer,
float *resultLayer) {
int index = blockIdx.x * blockDim.x + threadIdx.x;
int stride = blockDim.x * gridDim.x;
for (unsigned int i = index; i < firstNodes; i += stride) {
firstLayer[i] = 0.0f;
}
for (unsigned int i = index; i < secondNodes; i += stride) {
secondLayer[i] = 0.0f;
}
for (unsigned int i = index; i < resultNodes; i += stride) {
resultLayer[i] = 0.0f;
}
}

void allocateStuff(const unsigned int firstNodes,
const unsigned int secondNodes,
const unsigned int resultNodes,
float *firstLayer,
float *secondLayer,
float *resultLayer,
std::vector<float*> &firstWeightLayer,
std::vector<float*> &secondWeightLayer) {
cudaMallocManaged(&firstLayer, firstNodes * sizeof(float));
cudaMallocManaged(&secondLayer, secondNodes * sizeof(float));
cudaMallocManaged(&resultLayer, resultNodes * sizeof(float));

for (auto& nodeLayer : firstWeightLayer) {
cudaMallocManaged(&nodeLayer, secondNodes * sizeof(float));
}
for (auto& nodeLayer : secondWeightLayer) {
cudaMallocManaged(&nodeLayer, resultNodes * sizeof(float));
}
}

template<typename T, typename... Args>
void freeStuff(T *t) {
cudaFree(t);
}

template<typename T, typename... Args>
void freeStuff(T *t, Args... args) {
freeStuff(&t);
freeStuff(args...);
}

void freeStuff(std::vector<float*> &vec) {
for (auto& v : vec) {
freeStuff(&v);
}
}

int main () {
unsigned int firstNodes = 5, secondNodes = 3, resultNodes = 1;
float *firstLayer = new float[firstNodes];
float *secondLayer = new float[secondNodes];
float *resultLayer = new float[resultNodes];
std::vector<float*> firstWeightLayer(firstNodes, new float[secondNodes]);
std::vector<float*> secondWeightLayer(secondNodes, new float[resultNodes]);

allocateStuff(firstNodes, secondNodes, resultNodes,
firstLayer, secondLayer, resultLayer,
firstWeightLayer,secondWeightLayer);

layerInit<<<1,256>>>(firstNodes,
secondNodes,
resultNodes,
firstLayer,
secondLayer,
resultLayer);

cudaDeviceSynchronize();
freeStuff(firstLayer, secondLayer, resultLayer);
freeStuff(firstWeightLayer);
freeStuff(secondWeightLayer);

cudaProfilerStop();
return 0;
}

nvprof ./executable 函数 allocateStuff() 的输出:

==18608== NVPROF is profiling process 18608, command: ./executable
==18608== Profiling application: ./executable
==18608== Warning: 1 records have invalid timestamps due to insufficient device buffer space. You can configure the buffer space using the option --device-buffer-size.
==18608== Profiling result:
No kernels were profiled.
Type Time(%) Time Calls Avg Min Max Name
API calls: 96.20% 105.47ms 11 9.5884ms 5.7630us 105.39ms cudaMallocManaged
...

没有所述功能的nvprof ./executable 的输出:

==18328== NVPROF is profiling process 18328, command: ./executable
==18328== Profiling application: ./executable
==18328== Profiling result:
Type Time(%) Time Calls Avg Min Max Name
GPU activities: 100.00% 2.2080us 1 2.2080us 2.2080us 2.2080us layerInit(unsigned int, unsigned int, unsigned int, float*, float*, float*)
API calls: 99.50% 114.01ms 11 10.365ms 4.9390us 113.95ms cudaMallocManaged
...

编译器调用:nvcc -std=c++11 -g -o executable main.cu

最佳答案

  1. 任何时候您在使用 CUDA 代码时遇到问题,我建议您 proper CUDA error checking .我建议您在向其他人寻求帮助之前实现并检查它。即使您不理解错误输出,它也会对其他试图帮助您的人有用。

    如果我们将以下内容添加到您的代码末尾,不做任何其他更改:

    cudaError_t err = cudaGetLastError();  // add
    if (err != cudaSuccess) std::cout << "CUDA error: " << cudaGetErrorString(err) << std::endl; // add
    cudaProfilerStop();
    return 0;

    我们得到以下输出:

    CUDA error: an illegal memory access was encountered

    通过您的函数内分配代码实现,正在发生的事情是您编写的 CUDA 内核正在进行非法访问。

  2. 这里的主要问题是 C/C++ 编码错误。举一个例子,当您将 float *firstLayer 传递给 allocateStuff() 时,您是在按值传递 firstLayer >。这意味着对 firstLayer 数值的任何修改(即指针值本身,例如 cudaMallocManaged 正在做的事情)都不会出现在调用函数中(即会不会反射(reflect)在 main 中观察到的 firstLayer 的值中)。这真的与CUDA无关。如果您将一个裸指针传递给一个函数,然后使用例如分配该指针malloc() 也同样会被破坏。

    因为我们在这里看到了 C++,我们将通过传递这些指针来解决这个问题 by reference而不是按值。

  3. 创建托管分配时,不必像此处所示首先使用 new 分配指针。此外,虽然这不是这里任何问题的根源,但这会在您的程序中造成内存泄漏,因此您不应该这样做。

  4. 不确定为什么要在这里使用 & 符号:

    freeStuff(&v);

    这里:

    freeStuff(&t);

    当您剥离要传递给 cudaFree 的参数时,您应该直接传递这些参数,而不是这些参数的地址。

以下代码解决了这些问题:

$ cat t1592.cu
#include <cuda_profiler_api.h>
#include <iostream>
#include <vector>

__global__
void layerInit(const unsigned int firstNodes,
const unsigned int secondNodes,
const unsigned int resultNodes,
float *firstLayer,
float *secondLayer,
float *resultLayer) {
int index = blockIdx.x * blockDim.x + threadIdx.x;
int stride = blockDim.x * gridDim.x;
for (unsigned int i = index; i < firstNodes; i += stride) {
firstLayer[i] = 0.0f;
}
for (unsigned int i = index; i < secondNodes; i += stride) {
secondLayer[i] = 0.0f;
}
for (unsigned int i = index; i < resultNodes; i += stride) {
resultLayer[i] = 0.0f;
}
}

void allocateStuff(const unsigned int firstNodes,
const unsigned int secondNodes,
const unsigned int resultNodes,
float *&firstLayer,
float *&secondLayer,
float *&resultLayer,
std::vector<float*> &firstWeightLayer,
std::vector<float*> &secondWeightLayer) {
cudaMallocManaged(&firstLayer, firstNodes * sizeof(float));
cudaMallocManaged(&secondLayer, secondNodes * sizeof(float));
cudaMallocManaged(&resultLayer, resultNodes * sizeof(float));

for (auto& nodeLayer : firstWeightLayer) {
cudaMallocManaged(&nodeLayer, secondNodes * sizeof(float));
}
for (auto& nodeLayer : secondWeightLayer) {
cudaMallocManaged(&nodeLayer, resultNodes * sizeof(float));
}
}

template<typename T, typename... Args>
void freeStuff(T *t) {
cudaFree(t);
}

template<typename T, typename... Args>
void freeStuff(T *t, Args... args) {
freeStuff(t);
freeStuff(args...);
}

void freeStuff(std::vector<float*> &vec) {
for (auto& v : vec) {
freeStuff(v);
}
}

int main () {
unsigned int firstNodes = 5, secondNodes = 3, resultNodes = 1;
float *firstLayer; // = new float[firstNodes];
float *secondLayer; // = new float[secondNodes];
float *resultLayer; // = new float[resultNodes];
std::vector<float*> firstWeightLayer(firstNodes, new float[secondNodes]);
std::vector<float*> secondWeightLayer(secondNodes, new float[resultNodes]);

allocateStuff(firstNodes, secondNodes, resultNodes,
firstLayer, secondLayer, resultLayer,
firstWeightLayer,secondWeightLayer);

layerInit<<<1,256>>>(firstNodes,
secondNodes,
resultNodes,
firstLayer,
secondLayer,
resultLayer);

cudaDeviceSynchronize();
freeStuff(firstLayer, secondLayer, resultLayer);
freeStuff(firstWeightLayer);
freeStuff(secondWeightLayer);
cudaError_t err = cudaGetLastError();
if (err != cudaSuccess) std::cout << "CUDA error: " << cudaGetErrorString(err) << std::endl;
cudaProfilerStop();
return 0;
}
$ nvcc -o t1592 t1592.cu
$ cuda-memcheck ./t1592
========= CUDA-MEMCHECK
========= ERROR SUMMARY: 0 errors
[user2@dc10 misc]$ nvprof ./t1592
==23751== NVPROF is profiling process 23751, command: ./t1592
==23751== Profiling application: ./t1592
==23751== Profiling result:
Type Time(%) Time Calls Avg Min Max Name
GPU activities: 100.00% 355.63us 1 355.63us 355.63us 355.63us layerInit(unsigned int, unsigned int, unsigned int, float*, float*, float*)
API calls: 96.34% 328.78ms 11 29.889ms 7.4380us 328.69ms cudaMallocManaged
1.80% 6.1272ms 388 15.791us 360ns 1.7016ms cuDeviceGetAttribute
1.46% 4.9900ms 4 1.2475ms 595.29us 3.0996ms cuDeviceTotalMem
0.13% 444.60us 4 111.15us 97.400us 134.37us cuDeviceGetName
0.10% 356.98us 1 356.98us 356.98us 356.98us cudaDeviceSynchronize
0.10% 329.51us 1 329.51us 329.51us 329.51us cudaLaunchKernel
0.06% 212.66us 11 19.332us 10.066us 74.953us cudaFree
0.01% 27.695us 4 6.9230us 3.6950us 12.111us cuDeviceGetPCIBusId
0.00% 8.7990us 8 1.0990us 453ns 1.7600us cuDeviceGet
0.00% 6.2770us 3 2.0920us 368ns 3.8460us cuDeviceGetCount
0.00% 2.6700us 4 667ns 480ns 840ns cuDeviceGetUuid
0.00% 528ns 1 528ns 528ns 528ns cudaGetLastError

==23751== Unified Memory profiling result:
Device "Tesla V100-PCIE-32GB (0)"
Count Avg Size Min Size Max Size Total Size Total Time Name
1 - - - - 352.0640us Gpu page fault groups
$

注意事项:

  1. 在运行任何 CUDA 分析器之前,请确保您的代码没有任何 CUDA 报告的运行时错误。上面的最小错误检查结合 cuda-memcheck 的使用是很好的做法。

  2. 我并没有真正尝试确定 firstWeightLayersecondWeightLayer 是否存在任何潜在问题。它们不会造成任何运行时错误,但根据您尝试使用它们的方式,您可能会遇到麻烦。由于没有证据表明您将如何使用它们,所以我就此打住。

关于c++ - 为什么我将分配代码放在函数中时会得到 'insufficient buffer space'?,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/58902166/

33 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com