parallel-processing - 使用 cudaMallocManaged 时不允许从全局函数调用 __host_

parallel-processing - 使用 cudaMallocManaged 时不允许从全局函数调用 host 函数

转载作者：行者123 更新时间：2023-12-04 03:26:17

我有一个书面代码，我试图修改它以使其使用 CUDA，但我遇到了很多麻烦，目前，我试图使我想成为内核函数的函数无效，但我得到了一些错误

这是我收到的错误列表:

black_scholes.cu(54): error: calling a __host__ function("cudaMallocManaged<double> ") from a __global__ function("black_scholes_iterate") is not allowed

black_scholes.cu(54): error: identifier "cudaMallocManaged<double> " is undefined in device code

black_scholes.cu(56): error: calling a __host__ function("init_gaussrand_state") from a __global__ function("black_scholes_iterate") is not allowed

black_scholes.cu(56): error: identifier "init_gaussrand_state" is undefined in device code

black_scholes.cu(65): error: calling a __host__ function("spawn_prng_stream") from a __global__ function("black_scholes_iterate") is not allowed

black_scholes.cu(65): error: identifier "spawn_prng_stream" is undefined in device code

black_scholes.cu(66): error: calling a __host__ function("gaussrand1") from a __global__ function("black_scholes_iterate") is not allowed

black_scholes.cu(66): error: identifier "gaussrand1" is undefined in device code

black_scholes.cu(66): error: identifier "uniform_random_double" is undefined in device code

black_scholes.cu(73): error: calling a __host__ function("free_prng_stream") from a __global__ function("black_scholes_iterate") is not allowed

black_scholes.cu(73): error: identifier "free_prng_stream" is undefined in device code

black_scholes.cu(74): error: calling a __host__ function("cudaFree") from a __global__ function("black_scholes_iterate") is not allowed

black_scholes.cu(74): error: identifier "cudaFree" is undefined in device code

我特别发布了前 2 个错误，因为在通过 Nvidia 入门类(class)学习 CUDA 时，通常会在 __global__ 函数中调用 cudaMallocManaged 而我不会看不到这里有什么不同

这是我的 .cu 代码:

#include "black_scholes.h"
#include "gaussian.h"
#include "random.h" 
#include "util.h"
#include <assert.h>
#include <math.h>
#include <stdio.h>
#include <stdlib.h>
#include <time.h>

__managed__ double stddev;

__global__ void black_scholes_stddev (void* the_args)
{

  black_scholes_args_t* args = (black_scholes_args_t*) the_args;
  const double mean = args->mean;
  const int M = args->M;
  double variance = 0.0;
  int k = blockIdx.x * blockDim.x + threadIdx.x;

  if(k<M)
  {
   const double diff = args->trials[k] - mean;
   variance += diff * diff / (double) M;
  }

  args->variance = variance;
  stddev=sqrt(variance);

}


__global__ void black_scholes_iterate (void* the_args)
{

  black_scholes_args_t* args = (black_scholes_args_t*) the_args;

  const int S = args->S;
  const int E = args->E;
  const int M = args->M;
  const double r = args->r;
  const double sigma = args->sigma;
  const double T = args->T;

  double* trials = args->trials;
  double mean = 0.0;

  gaussrand_state_t gaussrand_state;
  void* prng_stream = NULL; 

double *randnumbs;
cudaMallocManaged(&randnumbs, M * sizeof (double));

init_gaussrand_state (&gaussrand_state);

int i = blockIdx.x * blockDim.x + threadIdx.x;
int k = blockIdx.x * blockDim.x + threadIdx.x;


//for (int i = 0; i < M; i++)
if(i<M)
{
  prng_stream = spawn_prng_stream(i%4);
  const double gaussian_random_number = gaussrand1 (&uniform_random_double, prng_stream, &gaussrand_state);
  randnumbs[i]=gaussian_random_number;
  const double current_value = S * exp ( (r - (sigma*sigma) / 2.0) * T + sigma * sqrt (T) * randnumbs[k]);
  trials[k] = exp (-r * T) * ((current_value - E < 0.0) ? 0.0 : current_value - E);
   mean += trials[k] / (double) M;//needs to be shared
  args->mean = mean;
}
  free_prng_stream (prng_stream);
  cudaFree(randnumbs);
}



void black_scholes (confidence_interval_t* interval,
           const double S,
           const double E,
           const double r,
           const double sigma,
           const double T,
           const int M,
         const int n)
{
  black_scholes_args_t args;
  double mean = 0.0;
  double conf_width = 0.0;
  double* trials = NULL;

  assert (M > 0);
  trials = (double*) malloc (M * sizeof (double));
  assert (trials != NULL);

  args.S = S;
  args.E = E;
  args.r = r;
  args.sigma = sigma;
  args.T = T;
  args.M = M;
  args.trials = trials;
  args.mean = 0.0;
  args.variance = 0.0;

  (void)black_scholes_iterate<<<1,1>>>(&args);
  mean = args.mean;
  black_scholes_stddev<<<1,1>>> (&args);
  cudaDeviceSynchronize();

  conf_width = 1.96 * stddev / sqrt ((double) M);
  interval->min = mean - conf_width;
  interval->max = mean + conf_width;

  deinit_black_scholes_args (&args);
}


void deinit_black_scholes_args (black_scholes_args_t* args)
{
  if (args != NULL)
    if (args->trials != NULL)
      {
    free (args->trials);
    args->trials = NULL;
      }
}

如果能帮助理解正在发生的事情，我们将不胜感激，这似乎是一个反复出现的主题。

最佳答案

目前，无法在 CUDA 设备代码中调用 cudaMallocManaged。这是不可能的。我不相信有 NVIDIA 培训 Material 演示如何在设备代码中使用 cudaMallocManaged。

如果您希望在内核中进行分配，我建议使用 the programming guide 中描述的方法.此外，new 和 delete 的工作方式类似于 malloc() 和 free()，用于在内核中使用。

关于parallel-processing - 使用 cudaMallocManaged 时不允许从全局函数调用 __host__ 函数，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/67521866/

文章推荐： tensorflow - 我如何操作？

文章推荐： tvos - TVML/TVJS中的动态XML模板

oracle - 在 Oracle 中，PARALLEL 被广泛使用。 PARALLEL、PARALLEL(8)、PARALLEL(a,8)有什么区别？
在 Oracle 中，PARALLEL 被广泛使用。提示 PARALLEL、PARALLEL(8) 和 PARALLEL(a,8) 有什么区别。如何选择最佳的查询提示？ SELECT /*+ PARA
parallel-processing - OMP : What is the difference between OMP PARALLEL DO and OMP DO (Without parallel directive at all)
好的，我希望以前没有问过这个问题，因为在搜索中很难找到。我查看了 F95 手册，但仍然觉得这很模糊: For the simple case of: DO i=0,99 END DO 我正
parallel-processing - GNU parallel 有两个参数
我有一个 C-shell 脚本，其中有一个名为 $hosts_string 的变量，格式为: host1,host2,...,hostN 我还有一个名为 $chrs_string 的变量，其形式为:
parallel-processing - Gnu平行: nested parallelism
是否可以从由gnu parallel产生的脚本的多次运行中调用gnu parallel？我有一个python脚本，可以运行100个顺序顺序迭代，并且在每次迭代中的某处，并行计算4个值(使用gnu p
gnu-parallel - GNU Parallel - 多个命令
我想在几个输入上运行几个长时间运行的进程。例如。: solver_a problem_1 solver_b problem_1 ... solver_b problem_18 solver_c pro
delphi - Parallel.For 和 Parallel.For 之间有区别吗？
TParallel.&For 和 TParallel.For 之间有区别吗？两者都可以在 Delphi 10 Seattle 中编译。那么我应该坚持哪一个呢？最佳答案 TParallel.&For
parallel-processing - Julia Parallel 宏似乎不起作用
我第一次使用 julia 进行并行计算.我有点头疼。所以假设我开始 julia如下:julia -p 4 .然后我为所有处理器声明 a 函数，然后将它与 pmap 一起使用还有@parallel fo
parallel-processing - "embarrassingly parallel"短语的来源
关闭。这个问题是off-topic .它目前不接受答案。想改善这个问题吗？ Update the question所以它是 on-topic对于堆栈溢出。 10年前关闭。 Improve this
c# - Parallel.For 与 Parallel.Invoke
我有一堆相互排斥的方法，因此可以并行运行。有这样做的好方法吗？到目前为止，我有以下两种实现方式，但我不确定是否应该选择其中一种。使用 Parallel.For : Parallel.For(0, 2
parallel-processing - 使用 GNU parallel 并行化具有各种参数的脚本
我对并行运行脚本很感兴趣，并且我已经开始查看 GNU 并行工具，但是我遇到了一些麻烦。我的脚本 doSomething 有 3 个参数，我想在参数的不同值上并行运行脚本。我该怎么做？我试过:para
parallel-processing - 使用 GNU parallel 在多核上运行并行作业
我需要在多核(和多线程)机器上运行多个作业。我正在使用 GNU Parallel utility跨核心分配作业以加速任务。要执行的命令在名为“命令”的文件中可用。我使用以下命令运行 GNU Paral
parallel-processing - 如何使用 gnu-parallel 处理具有两个输入的脚本？
我正在尝试使用如下两个输入运行 Python 脚本。我得到了大约 300 个这两个输入，所以我想知道是否有人可以建议如何并行运行它们。单次运行看起来像: python stable.py KOG_1
gnu-parallel - 如何使用 "GNU parallel"在多个目录中执行一个命令？
每天我都必须更新一堆存储库，并在其中一些中执行另一个命令(来自 CARTON，Perl 模块依赖管理器)。我总是使用循环来执行此操作，但我想与并行执行GNU 并行如果可能，但我不太了解它的tuto
parallel-processing - @parallel 和 pmap 到底有什么区别？
正如标题所说:@parallel 之间究竟有什么区别？和 pmap ?我的意思不是明显的一个是循环的宏，另一个适用于函数，我的意思是它们的实现究竟有什么不同，我应该如何使用这些知识在它们之间进行选择？
parallel-processing - Windows Azure : Parallelization of the code
我有一些矩阵乘法运算。我想通过多个处理器并行执行这些操作。这可以使用 MPI(消息传递接口(interface))在高性能计算集群上完成。同样，我可以使用多个辅助角色在云中进行一些并行化吗？有什么办
python - 为什么joblib.Parallel()比非并行计算花费更多的时间？ Parallel()的运行速度是否应该比非并行计算快？
joblib模块提供了一个简单的帮助程序类，以使用多处理并行编写循环的循环。这段代码使用列表推导来完成这项工作： import time from math import sqrt from job
c openmp parallel for inside a parallel region
我的问题是这样的one .但我想做一些不同的事情... 例如，在我的并行区域内，我想在 4 个线程上运行我的代码。当每个线程进入 for 循环时，我想在 8 个线程上运行我的代码。像 #pramga
parallel-processing - ipython 笔记本 : how to parallelize external script
我正在尝试使用 ipython 并行库中的并行计算。但是我对此知之甚少，而且我发现很难从对并行计算一无所知的人那里阅读该文档。有趣的是，我发现的所有教程都只是重复使用文档中的示例，并使用相同的解释，
parallel-processing - Gradle : Run subproject's tasks in parallel
我的项目结构看起来像 Root + subproj1 + subproj2 在每个子项目中定义了自己的任务 run(){}。我想要做的是从 Root 项目的运行任务并行运行 :subpro
parallel-processing - Parallel.ForEach 应该在 DB 调用中使用吗？
我有一个 Foo ID 的列表。我需要为每个 ID 调用一个存储过程。例如 Guid[] siteIds = ...; // typically contains 100 to 300 elemen

行者123

个人简介

我是一名优秀的程序员,十分优秀！

作者热门文章

滴滴打车优惠券免费领取

全站热门文章

首页

博学

6Ren·AI

商城

parallel-processing - 使用 cudaMallocManaged 时不允许从全局函数调用 host 函数