c++ - 为什么我将分配代码放在函数中时会得到 'insufficient buffer space'？-6ren

c++ - 为什么我将分配代码放在函数中时会得到 'insufficient buffer space'？

转载作者：行者123 更新时间：2023-11-28 04:04:56

所以我刚开始用 CUDA 编写，遵循 An Even Easier Introduction to CUDA指导。到目前为止，一切都很好。然后我想实现一个神经网络，这让我对函数 cudaMallocManaged() 进行了多次调用。因此，为了提高可读性，我决定将它们放在一个名为 allocateStuff() 的不同函数中(参见下面的代码)。然后，当我使用 nvprof 运行程序时，这不会显示 layerInit() 的 GPU 时间，而是给出以下警告:

Warning: 1 records have invalid timestamps due to insufficient device buffer space. You can configure the buffer space using the option --device-buffer-size.

但是，当我将代码直接放在 main() 中的 allocateStuff() 函数中时，警告不会发生，并且会显示 layerInit() 的 GPU 时间。所以现在我的问题是:我在这个函数中做错了什么，或者它(显然)溢出缓冲区的原因是什么？

代码:

#include <cuda_profiler_api.h>
#include <iostream>
#include <vector>

__global__
void layerInit(const unsigned int firstNodes,
               const unsigned int secondNodes,
               const unsigned int resultNodes,
               float *firstLayer,
               float *secondLayer,
               float *resultLayer) {
   int index = blockIdx.x * blockDim.x + threadIdx.x;
   int stride = blockDim.x * gridDim.x;
   for (unsigned int i = index; i < firstNodes; i += stride) {
      firstLayer[i] = 0.0f;
   }
   for (unsigned int i = index; i < secondNodes; i += stride) {
      secondLayer[i] = 0.0f;
   }
   for (unsigned int i = index; i < resultNodes; i += stride) {
      resultLayer[i] = 0.0f;
   }
}

void allocateStuff(const unsigned int firstNodes,
                   const unsigned int secondNodes,
                   const unsigned int resultNodes,
                   float *firstLayer,
                   float *secondLayer,
                   float *resultLayer,
                   std::vector<float*> &firstWeightLayer,
                   std::vector<float*> &secondWeightLayer) {
   cudaMallocManaged(&firstLayer,  firstNodes  * sizeof(float));
   cudaMallocManaged(&secondLayer, secondNodes * sizeof(float));
   cudaMallocManaged(&resultLayer, resultNodes * sizeof(float));

   for (auto& nodeLayer : firstWeightLayer) {
      cudaMallocManaged(&nodeLayer, secondNodes * sizeof(float));
   }
   for (auto& nodeLayer : secondWeightLayer) {
      cudaMallocManaged(&nodeLayer, resultNodes * sizeof(float));
   }
}

template<typename T, typename... Args>
void freeStuff(T *t) {
   cudaFree(t);
}

template<typename T, typename... Args>
void freeStuff(T *t, Args... args) {
   freeStuff(&t);
   freeStuff(args...);
}

void freeStuff(std::vector<float*> &vec) {
   for (auto& v : vec) {
      freeStuff(&v);
   }
}

int main () {
   unsigned int firstNodes = 5, secondNodes = 3, resultNodes = 1;
   float *firstLayer = new float[firstNodes];
   float *secondLayer = new float[secondNodes];
   float *resultLayer = new float[resultNodes];
   std::vector<float*> firstWeightLayer(firstNodes, new float[secondNodes]);
   std::vector<float*> secondWeightLayer(secondNodes, new float[resultNodes]);

   allocateStuff(firstNodes, secondNodes, resultNodes,
                 firstLayer, secondLayer, resultLayer,
                 firstWeightLayer,secondWeightLayer);

   layerInit<<<1,256>>>(firstNodes,
                        secondNodes,
                        resultNodes,
                        firstLayer,
                        secondLayer,
                        resultLayer);

   cudaDeviceSynchronize();
   freeStuff(firstLayer, secondLayer, resultLayer);
   freeStuff(firstWeightLayer);
   freeStuff(secondWeightLayer);

   cudaProfilerStop();
   return 0;
}

nvprof ./executable 函数 allocateStuff() 的输出:

==18608== NVPROF is profiling process 18608, command: ./executable
==18608== Profiling application: ./executable
==18608== Warning: 1 records have invalid timestamps due to insufficient device buffer space. You can configure the buffer space using the option --device-buffer-size.
==18608== Profiling result:
No kernels were profiled.
            Type  Time(%)      Time     Calls       Avg       Min       Max  Name
      API calls:   96.20%  105.47ms        11  9.5884ms  5.7630us  105.39ms  cudaMallocManaged
      ...

没有所述功能的nvprof ./executable 的输出:

==18328== NVPROF is profiling process 18328, command: ./executable
==18328== Profiling application: ./executable
==18328== Profiling result:
            Type  Time(%)      Time     Calls       Avg       Min       Max  Name
 GPU activities:  100.00%  2.2080us         1  2.2080us  2.2080us  2.2080us  layerInit(unsigned int, unsigned int, unsigned int, float*, float*, float*)
      API calls:   99.50%  114.01ms        11  10.365ms  4.9390us  113.95ms  cudaMallocManaged
      ...

编译器调用:nvcc -std=c++11 -g -o executable main.cu

最佳答案

任何时候您在使用 CUDA 代码时遇到问题，我建议您 proper CUDA error checking .我建议您在向其他人寻求帮助之前实现并检查它。即使您不理解错误输出，它也会对其他试图帮助您的人有用。
如果我们将以下内容添加到您的代码末尾，不做任何其他更改:
```
cudaError_t err = cudaGetLastError();  // add
if (err != cudaSuccess) std::cout << "CUDA error: " << cudaGetErrorString(err) << std::endl; // add
cudaProfilerStop();
return 0;
```
我们得到以下输出:
```
CUDA error: an illegal memory access was encountered
```
通过您的函数内分配代码实现，正在发生的事情是您编写的 CUDA 内核正在进行非法访问。
这里的主要问题是 C/C++ 编码错误。举一个例子，当您将 float *firstLayer 传递给 allocateStuff() 时，您是在按值传递 firstLayer >。这意味着对 firstLayer 数值的任何修改(即指针值本身，例如 cudaMallocManaged 正在做的事情)都不会出现在调用函数中(即会不会反射(reflect)在 main 中观察到的 firstLayer 的值中)。这真的与CUDA无关。如果您将一个裸指针传递给一个函数，然后使用例如分配该指针malloc() 也同样会被破坏。
因为我们在这里看到了 C++，我们将通过传递这些指针来解决这个问题 by reference而不是按值。
创建托管分配时，不必像此处所示首先使用 new 分配指针。此外，虽然这不是这里任何问题的根源，但这会在您的程序中造成内存泄漏，因此您不应该这样做。
不确定为什么要在这里使用 & 符号:
```
freeStuff(&v);
```
这里:
```
freeStuff(&t);
```
当您剥离要传递给 cudaFree 的参数时，您应该直接传递这些参数，而不是这些参数的地址。

以下代码解决了这些问题:

$ cat t1592.cu
#include <cuda_profiler_api.h>
#include <iostream>
#include <vector>

__global__
void layerInit(const unsigned int firstNodes,
               const unsigned int secondNodes,
               const unsigned int resultNodes,
               float *firstLayer,
               float *secondLayer,
               float *resultLayer) {
   int index = blockIdx.x * blockDim.x + threadIdx.x;
   int stride = blockDim.x * gridDim.x;
   for (unsigned int i = index; i < firstNodes; i += stride) {
      firstLayer[i] = 0.0f;
   }
   for (unsigned int i = index; i < secondNodes; i += stride) {
      secondLayer[i] = 0.0f;
   }
   for (unsigned int i = index; i < resultNodes; i += stride) {
      resultLayer[i] = 0.0f;
   }
}

void allocateStuff(const unsigned int firstNodes,
                   const unsigned int secondNodes,
                   const unsigned int resultNodes,
                   float *&firstLayer,
                   float *&secondLayer,
                   float *&resultLayer,
                   std::vector<float*> &firstWeightLayer,
                   std::vector<float*> &secondWeightLayer) {
   cudaMallocManaged(&firstLayer,  firstNodes  * sizeof(float));
   cudaMallocManaged(&secondLayer, secondNodes * sizeof(float));
   cudaMallocManaged(&resultLayer, resultNodes * sizeof(float));

   for (auto& nodeLayer : firstWeightLayer) {
      cudaMallocManaged(&nodeLayer, secondNodes * sizeof(float));
   }
   for (auto& nodeLayer : secondWeightLayer) {
      cudaMallocManaged(&nodeLayer, resultNodes * sizeof(float));
   }
}

template<typename T, typename... Args>
void freeStuff(T *t) {
   cudaFree(t);
}

template<typename T, typename... Args>
void freeStuff(T *t, Args... args) {
   freeStuff(t);
   freeStuff(args...);
}

void freeStuff(std::vector<float*> &vec) {
   for (auto& v : vec) {
      freeStuff(v);
   }
}

int main () {
   unsigned int firstNodes = 5, secondNodes = 3, resultNodes = 1;
   float *firstLayer; // = new float[firstNodes];
   float *secondLayer; // = new float[secondNodes];
   float *resultLayer; // = new float[resultNodes];
   std::vector<float*> firstWeightLayer(firstNodes, new float[secondNodes]);
   std::vector<float*> secondWeightLayer(secondNodes, new float[resultNodes]);

   allocateStuff(firstNodes, secondNodes, resultNodes,
                 firstLayer, secondLayer, resultLayer,
                 firstWeightLayer,secondWeightLayer);

   layerInit<<<1,256>>>(firstNodes,
                        secondNodes,
                        resultNodes,
                        firstLayer,
                        secondLayer,
                        resultLayer);

   cudaDeviceSynchronize();
   freeStuff(firstLayer, secondLayer, resultLayer);
   freeStuff(firstWeightLayer);
   freeStuff(secondWeightLayer);
   cudaError_t err = cudaGetLastError();
   if (err != cudaSuccess) std::cout << "CUDA error: " << cudaGetErrorString(err) << std::endl;
   cudaProfilerStop();
   return 0;
}
$ nvcc -o t1592 t1592.cu
$ cuda-memcheck ./t1592
========= CUDA-MEMCHECK
========= ERROR SUMMARY: 0 errors
[user2@dc10 misc]$ nvprof ./t1592
==23751== NVPROF is profiling process 23751, command: ./t1592
==23751== Profiling application: ./t1592
==23751== Profiling result:
            Type  Time(%)      Time     Calls       Avg       Min       Max  Name
 GPU activities:  100.00%  355.63us         1  355.63us  355.63us  355.63us  layerInit(unsigned int, unsigned int, unsigned int, float*, float*, float*)
      API calls:   96.34%  328.78ms        11  29.889ms  7.4380us  328.69ms  cudaMallocManaged
                    1.80%  6.1272ms       388  15.791us     360ns  1.7016ms  cuDeviceGetAttribute
                    1.46%  4.9900ms         4  1.2475ms  595.29us  3.0996ms  cuDeviceTotalMem
                    0.13%  444.60us         4  111.15us  97.400us  134.37us  cuDeviceGetName
                    0.10%  356.98us         1  356.98us  356.98us  356.98us  cudaDeviceSynchronize
                    0.10%  329.51us         1  329.51us  329.51us  329.51us  cudaLaunchKernel
                    0.06%  212.66us        11  19.332us  10.066us  74.953us  cudaFree
                    0.01%  27.695us         4  6.9230us  3.6950us  12.111us  cuDeviceGetPCIBusId
                    0.00%  8.7990us         8  1.0990us     453ns  1.7600us  cuDeviceGet
                    0.00%  6.2770us         3  2.0920us     368ns  3.8460us  cuDeviceGetCount
                    0.00%  2.6700us         4     667ns     480ns     840ns  cuDeviceGetUuid
                    0.00%     528ns         1     528ns     528ns     528ns  cudaGetLastError

==23751== Unified Memory profiling result:
Device "Tesla V100-PCIE-32GB (0)"
   Count  Avg Size  Min Size  Max Size  Total Size  Total Time  Name
       1         -         -         -           -  352.0640us  Gpu page fault groups
$

注意事项:

在运行任何 CUDA 分析器之前，请确保您的代码没有任何 CUDA 报告的运行时错误。上面的最小错误检查结合 cuda-memcheck 的使用是很好的做法。
我并没有真正尝试确定 firstWeightLayer 或 secondWeightLayer 是否存在任何潜在问题。它们不会造成任何运行时错误，但根据您尝试使用它们的方式，您可能会遇到麻烦。由于没有证据表明您将如何使用它们，所以我就此打住。

关于c++ - 为什么我将分配代码放在函数中时会得到 'insufficient buffer space'？，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/58902166/

文章推荐： c++ - 解决这个问题的时间复杂度是多少？

文章推荐： javascript - $InstallationId 标记的通知中心推送问题

文章推荐： html - 定位偏移 div/图像

文章推荐： javascript - 输入日期在 safari 上不起作用

python - "Insufficient Permission: Request had insufficient authentication scopes"即使是最通用的范围
我正在尝试从我的谷歌驱动器下载电子表格。我正在关注 this reference并且在许可方面遇到问题。我从 https://www.googleapis.com/auth/drive.file 开始
terminal - --递归错误: Insufficient Permission: Request had insufficient authentication scopes
我正在尝试使用以下命令将一个目录从我的虚拟机复制到我本地机器的桌面: gcloud compute scp --recurse instance-1:~/directory ~/Desktop/ 我尝
youtube - YouTube评论API引发 “Insufficient Permission: Request had insufficient authentication scopes”错误
我正在尝试获取Youtube的评论，但它引发了此异常- {“Google.Apis.Requests.RequestError \ r \ n权限不足:请求的身份验证范围不足。[403] \ r \
java - 捕获 tomcat 转储时为 "Insufficient memory or insufficient privileges to attach "
我试图用 jmap 捕获 tomcat 进程转储，但是我得到错误“内存不足或附加权限不足”，显然有足够的内存，并且当前登录用户在角色本地管理员组中。如果我以管理员身份运行 cmd 命令，它仍然失败。
google-api - 故障排除 : Google Calendar Insert API error "Insufficient Permission: Request had insufficient authentication scopes"
为了测试我们的应用程序，我需要插入/更新谷歌日历事件并验证边缘情况，比如 session 邀请是否超过 30 天，它不应该显示给最终用户。我正在为单个 gmail id testaccount@.co
python - 谷歌云存储给出 'insufficient permissions'
我正在使用这个 endpoint : get_media(bucket=*, object=*, ifGenerationNotMatch=None, generation=None, ifMeta
javascript - Stripe连接传输: Insufficient funds
我正在尝试将 Stripe 的 Connect 实现到我的应用程序中。我已经完成了数小时的研究和试错方法调试，现在我遇到的情况是没有出现技术错误，但出现错误: Insufficient funds i
iOS - 高优先级推送 : Insufficient Resources
我正在尝试在我的应用程序中使用静默推送通知，它似乎可以正常工作一两个小时，但在此期间之后将不会发送通知，并且我收到“高优先级推送:bundleID-资源不足”警告。任何人都知道可能是什么问题？最佳答
openldap - ldap_add : Insufficient access (50)
我正在尝试使用以下命令添加以下条目: ldapadd -Y EXTERNAL -H ldapi:/// -f server5_ldap.ldif server5_ldap.ldif 的内容如下: #
azure - aks 报告 "Insufficient pods"
我已完成描述的 Azure Cats&Dogs 教程 here我在 AKS 中启动应用程序的最后一步中遇到错误。 Kubernetes 报告我的 Pod 不足，但我不确定为什么会出现这种情况。几周前我
python - "... has insufficient rank for batching."这3行代码有什么问题？
这是我在这里的第一个问题。我一直想用流行的 IMDb 数据集创建一个数据集用于学习目的。目录如下: .../train/pos/和 .../train/neg/。我创建了一个函数，它将文本文件与其标
android - Gradle 构建错误 : insufficient memory
当我使用 gradle 构建时，它以信息失败: OpenJDK 64-Bit Server VM warning: INFO: os::commit_memory(0x0000000788800000
docker - NiFi : Why Does My User Have Insufficient Permissions?
我正在执行 https://hub.docker.com/r/apache/nifi 的“独立实例，双向 SSL”部分中的步骤。 .但是，当我访问 NiFi 页面时，我的用户权限不足。以下是我正在使用
kubernetes - 谷歌云 : insufficient authentication scopes
我在向部署在我的 Google Cloud Kubernetes 集群中的 Spring Boot 应用程序发送请求时遇到困难。我的应用程序收到一张照片并将其发送到 Google Vision API
ios - "Missing or insufficient permissions"在新创建的文档的子集合上
我开发了第一个 Firestore 应用程序并没有真正意识到我没有构建权限模型，事实上我后来用 bolt 固定了它。 (如果这是问题的根源，我愿意接受有关权限最佳实践和/或如何更好地实现权限规则的反馈
java - Karaf 启动结果 Insufficient credentials
我在 Ubuntu 上遇到 Apache-Karaf 3.0.0 的问题我想用命令“start”启动一个包。但我收到以下错误: Error executing command: Insufficien
c++ - 如何解决c++中mysql_init期间的mysql错误 "insufficient memory"？
我在 Ubuntu 12.04 上新部署服务器程序“MyServer”时遇到问题。该程序在第一台机器上运行良好。但是在新机器上，MyServer程序在mysql_init()期间返回异常:“内存不足
c++ - 为什么我将分配代码放在函数中时会得到 'insufficient buffer space'？
所以我刚开始用 CUDA 编写，遵循 An Even Easier Introduction to CUDA指导。到目前为止，一切都很好。然后我想实现一个神经网络，这让我对函数 cudaMallocM
c++ - OpenCV 错误 : Insufficient memory
我正在用 c++ 风格的 opencv 2.3 开发一个项目。在应用程序中，我加载视频并处理每一帧，并对 Mat 对象做一些事情。一段时间后，我收到内存不足错误。我像这样捕捉帧: FCapture
node.js - 未处理的PromiseRejection警告: Insufficient funds
我正在使用 web3.js v1.0.0-beta.34 和 nodeJS v9.11.2 在 Kovan 测试网上执行智能合约。同样的方法适用于我在 Ropsten 上使用另一个智能合约。这是我通过

行者123

个人简介

我是一名优秀的程序员,十分优秀！

作者热门文章

滴滴打车优惠券免费领取

全站热门文章

首页

博学

6Ren·AI

商城

c++ - 为什么我将分配代码放在函数中时会得到 'insufficient buffer space'？