c++ - 在带有 GPU 的 Halide 上使用 extern-6ren

c++ - 在带有 GPU 的 Halide 上使用 extern

转载作者：行者123 更新时间：2023-11-30 02:43:19

我尝试在 Halide 中使用 extern 函数。在我的上下文中，我想在 GPU 上进行。

我在 AOT 编译中使用 opencl 语句进行编译。当然opencl还是可以使用CPU的，所以我是这样用的:

halide_set_ocl_device_type("gpu");

目前，一切都在 compute_root() 中安排。

第一个问题，如果我使用 compute_root() 和 OpenCL gpu，我的进程是否会在具有一些 CopyHtoD 和 DtoH 的设备上计算？ (或者它将在主机缓冲区上)

第二个问题，跟extern函数更相关。我们使用一些外部调用，因为我们的一些算法不在 Halide 中。外部调用:

foo.define_extern("cool_foo", args, Float(32), 4);

外部检索:extern "C"int cool_foo(buffer_t * in, int w, int h, int z, buffer_t * out){ .. }

但是，在 cool_foo 函数中，我的 buffer_t 仅加载到主机内存中。开发地址为 0(默认)。

如果我尝试在算法之前复制内存:

halide_copy_to_dev(NULL, &in);

它什么都不做。

如果我只提供设备内存:

in.host = NULL;

我的主机指针为空，但设备地址仍为0。

(我的情况下 dev_dirty 为真，host_dirty 为假)

有什么想法吗？

编辑(回答dsharlet)

这是我的代码结构:

在 CPU 上正确解析数据。 --> 在 GPU 上发送缓冲区(使用 halide_copy_to_dev...) --> 进入 Halide 结构，读取参数并添加边界条件 --> 进入我的外部函数 -->...

我的外部函数中没有有效的 buffer_t。我在 compute_root() 中安排所有内容，但使用 HL_TARGET=host-opencl 并将 ocl 设置为 gpu。在进入 Halide 之前，我可以读取我的设备地址，这没问题。

这是我的代码:

在 Halide 之前，一切都是 CPU 的东西(指针)，我们将它转移到 GPU

buffer_t k = { 0, (uint8_t *) k_full, {w_k, h_k, num_patch_x * num_patch_y * 3}, {1, w_k, w_k * h_k}, {0}, sizeof(float), };
#if defined( USEGPU )
    // Transfer into GPU
    halide_copy_to_dev(NULL, &k);
    k.host_dirty = false;
    k.dev_dirty = true;
    //k.host = NULL; // It's k_full
#endif
halide_func(&k)

内部 Halide :

ImageParam ...
Func process;
process = halide_sub_func(k, width, height, k.channels());
process.compute_root();

...

Func halide_sub_func(ImageParam k, Expr width, Expr height, Expr patches)
{
    Func kBounded("kBounded"), kShifted("kShifted"), khat("khat"), khat_tuple("khat_tuple");
    kBounded = repeat_image(constant_exterior(k, 0.0f), 0, width, 0, height, 0, patches);
    kShifted(x, y, pi) = kBounded(x + k.width() / 2, y + k.height() / 2, pi);

    khat = extern_func(kShifted, width, height, patches);
    khat_tuple(x, y, pi) = Tuple(khat(0, x, y, pi), khat(1, x, y, pi));

    kShifted.compute_root();
    khat.compute_root();

    return khat_tuple;
}

外部 Halide (外部函数):

inline .... 
{
   //The buffer_t.dev and .host are 0 and null. I expect a null from the host, but the dev..
}

最佳答案

我找到了我的问题的解决方案。

我在这里用代码发布了答案。 (由于我做了一点线下测试，变量名不匹配)

内部 Halide :(Halide_func.cpp)

#include <Halide.h>


 using namespace Halide;

 using namespace Halide::BoundaryConditions;

 Func thirdPartyFunction(ImageParam f);
 Func fourthPartyFunction(ImageParam f);
 Var x, y;

 int main(int argc, char **argv) {
    // Input:
    ImageParam f( Float( 32 ), 2, "f" );

    printf(" Argument: %d\n",argc);

    int test = atoi(argv[1]);

    if (test == 1) {
        Func f1;
        f1(x, y) = f(x, y) + 1.0f;
        f1.gpu_tile(x, 256);
        std::vector<Argument> args( 1 );
        args[ 0 ] = f;
        f1.compile_to_file("halide_func", args);

    } else if (test == 2) {
        Func fOutput("fOutput");
        Func fBounded("fBounded");
        fBounded = repeat_image(f, 0, f.width(), 0, f.height());
        fOutput(x, y) = fBounded(x-1, y) + 1.0f;


        fOutput.gpu_tile(x, 256);
        std::vector<Argument> args( 1 );
        args[ 0 ] = f;
        fOutput.compile_to_file("halide_func", args);

    } else if (test == 3) {
        Func h("hOut");

        h = thirdPartyFunction(f);

        h.gpu_tile(x, 256);
        std::vector<Argument> args( 1 );
        args[ 0 ] = f;
        h.compile_to_file("halide_func", args);

    } else {
        Func h("hOut");

        h = fourthPartyFunction(f);

        std::vector<Argument> args( 1 );
        args[ 0 ] = f;
        h.compile_to_file("halide_func", args);
    }
 }

 Func thirdPartyFunction(ImageParam f) {
     Func g("g");
     Func fBounded("fBounded");
     Func h("h");
     //Boundary
     fBounded = repeat_image(f, 0, f.width(), 0, f.height());
     g(x, y) = fBounded(x-1, y) + 1.0f;
     h(x, y) = g(x, y) - 1.0f;

     // Need to be comment out if you want to use GPU schedule.
     //g.compute_root(); //At least one stage schedule alone
     //h.compute_root();

     return h;
 }

Func fourthPartyFunction(ImageParam f) {
    Func fBounded("fBounded");
    Func g("g");
    Func h("h");

    //Boundary
    fBounded = repeat_image(f, 0, f.width(), 0, f.height());

    // Preprocess
    g(x, y) = fBounded(x-1, y) + 1.0f;

    g.compute_root();
    g.gpu_tile(x, y, 256, 1);


    // Extern
    std::vector < ExternFuncArgument > args = { g, f.width(), f.height() };
    h.define_extern("extern_func", args, Int(16), 3);

    h.compute_root();
    return h;
}

外部函数:(external_func.h)

#include <cstdint>
#include <cstdio>
#include <cstdlib>
#include <cassert>
#include <cinttypes>
#include <cstring>
#include <fstream>
#include <map>
#include <vector>
#include <complex>
#include <chrono>
#include <iostream>


#include <clFFT.h> // All OpenCL I need are include.

using namespace std;
// Useful stuff.
void completeDetails2D(buffer_t buffer) {
    // Read all elements:
    std::cout << "Buffer information:" << std::endl;
    std::cout << "Extent: " << buffer.extent[0] << ", " << buffer.extent[1] << std::endl;
    std::cout << "Stride: " << buffer.stride[0] << ", " << buffer.stride[1] << std::endl;
    std::cout << "Min: " << buffer.min[0] << ", " << buffer.min[1] << std::endl;
    std::cout << "Elem size: " << buffer.elem_size << std::endl;
    std::cout << "Host dirty: " << buffer.host_dirty << ", Dev dirty: " << buffer.dev_dirty << std::endl;
    printf("Host pointer: %p, Dev pointer: %" PRIu64 "\n\n\n", buffer.host, buffer.dev);
}

extern cl_context _ZN6Halide7Runtime8Internal11weak_cl_ctxE;
extern cl_command_queue _ZN6Halide7Runtime8Internal9weak_cl_qE;


extern "C" int extern_func(buffer_t * in, int width, int height, buffer_t * out)
{
    printf("In extern\n");
    completeDetails2D(*in);
    printf("Out extern\n");
    completeDetails2D(*out);

    if(in->dev == 0) {
        // Boundary stuff
        in->min[0] = 0;
        in->min[1] = 0;
        in->extent[0] = width;
        in->extent[1] = height;
        return 0;
    }

    // Super awesome stuff on GPU
    // ...

    cl_context & ctx = _ZN6Halide7Runtime8Internal11weak_cl_ctxE; // Found by zougloub
    cl_command_queue & queue = _ZN6Halide7Runtime8Internal9weak_cl_qE; // Same

    printf("ctx: %p\n", ctx);

    printf("queue: %p\n", queue);

    cl_mem buffer_in;
    buffer_in = (cl_mem) in->dev;
    cl_mem buffer_out;
    buffer_out = (cl_mem) out->dev;

    // Just copying data from one buffer to another
    int err = clEnqueueCopyBuffer(queue, buffer_in, buffer_out, 0, 0, 256*256*4, 0, NULL, NULL);

    printf("copy: %d\n", err);

    err = clFinish(queue);

    printf("finish: %d\n\n", err);

    return 0;
}

最后，非 Halide :(Halide_test.cpp)

#include <halide_func.h>
#include <iostream>
#include <cinttypes>

#include <external_func.h>

// Extern function available inside the .o generated.
#include "HalideRuntime.h"

int main(int argc, char **argv) {

    // Init the kernel in GPU
    halide_set_ocl_device_type("gpu");

    // Create a buffer
    int width = 256;
    int height = 256;
    float * bufferHostIn = (float*) malloc(sizeof(float) * width * height);
    float * bufferHostOut = (float*) malloc(sizeof(float) * width * height);

    for( int j = 0; j < height; ++j) {
        for( int i = 0; i < width; ++i) {
            bufferHostIn[i + j * width] = i+j;
        }
    }

    buffer_t bufferHalideIn = {0, (uint8_t *) bufferHostIn, {width, height}, {1, width, width * height}, {0, 0}, sizeof(float), true, false};
    buffer_t bufferHalideOut = {0, (uint8_t *) bufferHostOut, {width, height}, {1, width, width * height}, {0, 0}, sizeof(float), true, false};

    printf("IN\n");
    completeDetails2D(bufferHalideIn);
    printf("Data (host): ");
    for(int i = 0; i < 10; ++ i) {
        printf(" %f, ", bufferHostIn[i]);
    }
    printf("\n");

    printf("OUT\n");
    completeDetails2D(bufferHalideOut);

    // Send to GPU
    halide_copy_to_dev(NULL, &bufferHalideIn);
    halide_copy_to_dev(NULL, &bufferHalideOut);
    bufferHalideIn.host_dirty = false;
    bufferHalideIn.dev_dirty = true;
    bufferHalideOut.host_dirty = false;
    bufferHalideOut.dev_dirty = true;
    // TRICKS Halide to force the use of device.
    bufferHalideIn.host = NULL;
    bufferHalideOut.host = NULL;

    printf("IN After device\n");
    completeDetails2D(bufferHalideIn);

    // Halide function
    halide_func(&bufferHalideIn, &bufferHalideOut);

    // Get back to HOST
    bufferHalideIn.host = (uint8_t*)bufferHostIn;
    bufferHalideOut.host = (uint8_t*)bufferHostOut;
    halide_copy_to_host(NULL, &bufferHalideOut);
    halide_copy_to_host(NULL, &bufferHalideIn);

    // Validation
    printf("\nOUT\n");
    completeDetails2D(bufferHalideOut);
    printf("Data (host): ");
    for(int i = 0; i < 10; ++ i) {
        printf(" %f, ", bufferHostOut[i]);
    }
    printf("\n");

    // Free all
    free(bufferHostIn);
    free(bufferHostOut);

}

您可以使用测试 4 编译 halide_func 以使用所有外部功能。

这是我的一些结论。 (感谢 Zalman 和 zougloub)

如果您单独使用它，Compute_root 不要调用该设备。
我们需要在代码中调用 gpu_tile() 的 gpu() 来调用 GPU 例程。 (顺便说一句，您需要将所有变量放入其中)
gpu_tile les than your item would crash your stuff.
边界条件在 GPU 中运行良好。
在调用外部函数之前，作为输入的 Func 需要是:f.compute_root(); f.gpu_tile(x,y,...,...);中间阶段的compute_root是不隐含的。
如果dev地址为0，则正常，我们重新发送维度，会再次调用extern。
作为 compute_root() 隐含的最后阶段。

关于c++ - 在带有 GPU 的 Halide 上使用 extern，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/26265484/

文章推荐： java - 如何从存在的多个整数中提取一个整数值

文章推荐： c# - 如何在不购买域的情况下托管我的 ASP.Net 管理面板？

文章推荐： java - 吉斯 : How to customize the bindings of a third-party Module?

c++ - Halide:将 C++ 函数传递给 Halide Func
我有一张二值图像，我想使用 Halide 从图像的顶部开始为每一列找到第一个非零像素。在 C++ 中，给定名为 mask 的图像，它看起来像这样: vector top_y; top_y.reser
halide - 使用增强型生成器的双边网格生成器类
我正在尝试使用增强的生成器类重新实现双边网格示例(例如使用 schedule() 和 generate()。但我在尝试编译代码时遇到错误。 g++ -std=c++11 -I ../../includ
halide - 使用增强型生成器的双边网格生成器类
我正在尝试使用增强的生成器类重新实现双边网格示例(例如使用 schedule() 和 generate()。但我在尝试编译代码时遇到错误。 g++ -std=c++11 -I ../../includ
Halide:OpenCL 代码生成
是否可以在 Halide 中生成包含生成的 OpenCL 代码的文件？我试图从目标是 opencl 的 Halide 程序生成一个 c 文件，但我在那里没有看到任何 opencl 特定代码。编辑1:
halide - 为什么我的表现不好？ (菜鸟调度)
我主要是一名非常高级的程序员，因此思考 CPU 局部性等问题对我来说是非常新鲜的。我正在研究一个基本的双线性去马赛克(用于 RGGB 传感器数据)，并且我的算法是正确的(根据结果判断)，但它的性能没
c++ - Halide 最优调度
我正在尝试为基准 Halide 代码制定最佳时间表，但我可能会遗漏一些东西，因为计时结果对我来说意义不大。我正在使用 AOT 编译，下面是代码的算法部分: ImageParam input1(typ
c++ - Halide 如何自动调整时间表
我已经尝试用 Halide 编写代码一段时间了，而且我总是自己编写时间表。然后我读了这篇论文:http://graphics.cs.cmu.edu/projects/halidesched/mulla
c++ - Halide 元组用法
我想使用 Halide 生成多个输出缓冲区。 Func output; std::vector argsExpr( 4 ); argsExpr[ 0 ] = aOut( x, y ); argsExp
c++ - Halide 在归一化互相关期间挂起
我正在尝试在 Halide 中实现归一化互相关。下面的代码构建，Halide JIT 编译不会抛出任何错误。但是，Halide 似乎在 JIT 编译后挂起。无论我对不同的 Func 调用了多少次 t
c++ - Halide 外部法
我使用 AOT 编译来使用没有 Halide 库的 Halide 代码。我在 HalideRuntime.h(在资源中可用)中看到我的 .o 文件中有许多可用的外部方法。 halide_dev_ma
c++ - Halide 编程语言入门？
我正在尝试开始使用一种名为 Halide 的用于图像处理的特定领域语言(C++ 扩展) . 在 Halide README 之后，这是我尝试过的: 下载了 Ubuntu 12.04 Halide bi
c++ - Halide - while 循环等效
我正在尝试在 Halide 中实现 Meijster 距离变换算法。我已经重写了 this code到 C++(使用 openCV)并且工作正常。关于该算法的论文是here .现在我的 Halide
c++ - Halide 可变域减少
现在我正在尝试编写一些对图像进行子采样的 Halide 代码。基本上我希望图像的每 2 x 2 平方减少到一个包含最大值的像素。一个简单的例子是转换 1 2 3 4 5 6 7 8 9 0 1 2 4
c++ - Halide 的性能计数器？
是否有适用于使用 Halide 语言编写的代码的性能计数器？我想知道我的代码执行了多少加载、存储和 ALU 操作。用于调度多阶段管道的 Halide 教程通过比较分配的内存量、加载、存储和对 hal
c++ - 无法在 Halide 中加载灰度图像
尝试加载灰度图像png格式以执行此代码时出现错误。我的程序是Halide Tutorial类(class)2的一部分。这是我的代码: #include #include "Halid
c++ - 带有GPU时间表的 Halide 产生黑色图像
我正在尝试学习Halide，但我无法正确使用GPU，因为在安排GPU时它会生成黑色图像。对于CPU而言，它会产生良好的结果(注释掉brighter.gpu_tile(x，y，xo，yo，xi，yi，8
c++ - Halide FFT 实现错误？
我正在尝试运行找到的 Halide FFT 实现 here用于针对 FTTW 进行基准测试。我能够按原样运行实现，但在深入挖掘时遇到了一些问题。该例程因 H 和 W 的不同值(随机输入图像的高度和宽度
c++ - Halide Jit 编译
我正在尝试将我的 Halide 程序编译为 jit，以便稍后在不同图像的代码中多次使用它。但是我想我做错了什么，有人可以纠正我吗？首先，我创建要运行的 Halide 函数: void m_gammaF
python - 更改 Halide 输出缓冲区布局
我正在尝试更改实现的缓冲区布局。我知道 Halide 的目的是允许“一次”定义算法，然后能够分别更改调度和存储布局等内容。我已经尝试过 my_output_function.reorder_stor
c++ - 如何使用 Halide 分析器
几周来我一直在探索 Halide 的可能性，为了更好地了解 Halide 的作用，我想尝试使用 halide 分析器。假设我有一个 Func 测试。 (为了便于阅读，我省略了变量声明等。)f=函数(测

行者123

个人简介

我是一名优秀的程序员,十分优秀！

作者热门文章

滴滴打车优惠券免费领取

全站热门文章

首页

博学

6Ren·AI

商城

c++ - 在带有 GPU 的 Halide 上使用 extern