c++ - Halide Jit 编译-6ren

c++ - Halide Jit 编译

转载作者：行者123 更新时间：2023-11-30 03:39:01

我正在尝试将我的 Halide 程序编译为 jit，以便稍后在不同图像的代码中多次使用它。但是我想我做错了什么，有人可以纠正我吗？首先，我创建要运行的 Halide 函数:

void m_gammaFunctionTMOGenerate()
{
    Halide::ImageParam img(Halide::type_of<float>(), 3);
    img.set_stride(0, 4);
    img.set_stride(2, 1);
    Halide::Var x, y, c;
    Halide::Param<float> key, sat, clampMax, clampMin;
    Halide::Param<bool> cS;
    Halide::Func gamma;
    // algorytm
    //img.width() , img.height();
    if (cS.get())
    {
        float k1 = 1.6774;
        float k2 = 0.9925;
        sat.set((1 + k1) * pow(key.get(), k2) / (1 + k1 * pow(key.get(), k2)));
    }
    Halide::Expr luminance = img(x, y, 0) * 0.072186f + img(x, y, 1) * 0.715158f + img(x, y, 2) *  0.212656f;
    Halide::Expr ldr_lum = (luminance - clampMin) / (clampMax - clampMin);
    Halide::clamp(ldr_lum, 0.f, 1.f);
    ldr_lum = Halide::pow(ldr_lum, key);
    Halide::Expr imLum = img(x, y, c) / luminance;
    imLum = Halide::pow(imLum, sat) * ldr_lum;
    Halide::clamp(imLum, 0.f, 1.f);
    gamma(x, y, c) = imLum;
    // rozkład
    gamma.vectorize(x, 16).parallel(y);

    // kompilacja
    auto & obuff = gamma.output_buffer();
    obuff.set_stride(0, 4);
    obuff.set_stride(2, 1);
    obuff.set_extent(2, 3);
    std::vector<Halide::Argument> arguments = { img, key, sat, clampMax, clampMin, cS };
    m_gammaFunction = (gammafunction)(gamma.compile_jit());

}

将其存储在指针中:

typedef int(*gammafunction)(buffer_t*, float, float, float, float, bool, buffer_t*);
gammafunction m_gammaFunction;

然后我尝试运行它:

buffer_t  output_buf = { 0 };
//// The host pointers point to the start of the image data:
buffer_t buf = { 0 };
buf.host = (uint8_t *)data; // Might also need const_cast
float * output = new float[width * height * 4];
output_buf.host = (uint8_t*)(output);
//                                // If the buffer doesn't start at (0, 0), then assign mins
output_buf.extent[0] = buf.extent[0] = width; // In elements, not bytes
output_buf.extent[1] = buf.extent[1] = height; // In elements, not bytes
output_buf.extent[2] = buf.extent[2] = 4;    // Assuming RGBA
//                 // No need to assign additional extents as they were init'ed to zero above
output_buf.stride[0] = buf.stride[0] = 4; // RGBA interleaved
output_buf.stride[1] = buf.stride[1] = width * 4; // Assuming no line padding
output_buf.stride[2] = buf.stride[2] = 1; // Channel interleaved
output_buf.elem_size = buf.elem_size = sizeof(float);

// Run the pipeline
int error = m_photoFunction(&buf, params[0], &output_buf);

但是没用...错误:

Exception thrown at 0x000002974F552DE0 in Viewer.exe: 0xC0000005: Access violation executing location 0x000002974F552DE0.

If there is a handler for this exception, the program may be safely continued.

编辑:

这是我运行函数的代码:

buffer_t  output_buf = { 0 };
//// The host pointers point to the start of the image data:
buffer_t buf = { 0 };
buf.host = (uint8_t *)data; // Might also need const_cast
float * output = new float[width * height * 4];
output_buf.host = (uint8_t*)(output);
//                                // If the buffer doesn't start at (0, 0), then assign mins
output_buf.extent[0] = buf.extent[0] = width; // In elements, not bytes
output_buf.extent[1] = buf.extent[1] = height; // In elements, not bytes
output_buf.extent[2] = buf.extent[2] = 3;    // Assuming RGBA
                                             //                // No need to assign additional extents as they were init'ed to zero above
output_buf.stride[0] = buf.stride[0] = 4; // RGBA interleaved
output_buf.stride[1] = buf.stride[1] = width * 4; // Assuming no line padding
output_buf.stride[2] = buf.stride[2] = 1; // Channel interleaved
output_buf.elem_size = buf.elem_size = sizeof(float);

// Run the pipeline
int error = m_gammaFunction(&buf, params[0], params[1], params[2], params[3], params[4] > 0.5 ? true : false, &output_buf);

if (error) {
    printf("Halide returned an error: %d\n", error);
    return -1;
}

memcpy(output, data, size * sizeof(float));

有人可以帮我吗？

编辑:

感谢@KhouriGiordano，我发现我做错了什么。事实上，我从 AOT 编译切换到这段代码。所以现在我的代码看起来像这样:

class GammaOperator
{
public:
    GammaOperator();

    int realize(buffer_t * input, float params[], buffer_t * output, int width);
private:

    HalideFloat m_key;
    HalideFloat m_sat;
    HalideFloat m_clampMax;
    HalideFloat m_clampMin;
    HalideBool  m_cS;

    Halide::ImageParam m_img;
    Halide::Var x, y, c;
    Halide::Func m_gamma;
};


GammaOperator::GammaOperator()
    : m_img( Halide::type_of<float>(), 3)
{

    Halide::Expr w = (1.f + 1.6774f) * pow(m_key.get(), 0.9925f) / (1.f + 1.6774f * pow(m_key.get(), 0.9925f));
    Halide::Expr sat = Halide::select(m_cS, m_sat, w);

    Halide::Expr luminance = m_img(x, y, 0) * 0.072186f + m_img(x, y, 1) * 0.715158f + m_img(x, y, 2) *  0.212656f;
    Halide::Expr ldr_lum = (luminance - m_clampMin) / (m_clampMax - m_clampMin);
    ldr_lum = Halide::clamp(ldr_lum, 0.f, 1.f);
    ldr_lum = Halide::pow(ldr_lum, m_key);
    Halide::Expr imLum = m_img(x, y, c) / luminance;
    imLum = Halide::pow(imLum, sat) * ldr_lum;
    imLum = Halide::clamp(imLum, 0.f, 1.f);
    m_gamma(x, y, c) = imLum;

}

int GammaOperator::realize(buffer_t * input, float params[], buffer_t * output, int width)
{
    m_img.set(Halide::Buffer(Halide::type_of<float>(), input));
    m_img.set_stride(0, 4);
    m_img.set_stride(1, width * 4);
    m_img.set_stride(2, 4);
    // algorytm
    m_gamma.vectorize(x, 16).parallel(y);

    //params[0], params[1], params[2], params[3], params[4] > 0.5 ? true : false
    //{ img, key, sat, clampMax, clampMin, cS };
    m_key.set(params[0]);
    m_sat.set(params[1]);
    m_clampMax.set(params[2]);
    m_clampMin.set(params[3]);
    m_cS.set(params[4] > 0.5f ? true : false);
    //// kompilacja
    m_gamma.realize(Halide::Buffer(Halide::type_of<float>(), output));
    return 0;
}

我是这样使用它的:

buffer_t  output_buf = { 0 };
    //// The host pointers point to the start of the image data:
    buffer_t buf = { 0 };
    buf.host = (uint8_t *)data; // Might also need const_cast
    float * output = new float[width * height * 4];
    output_buf.host = (uint8_t*)(output);
    //                                // If the buffer doesn't start at (0, 0), then assign mins
    output_buf.extent[0] = buf.extent[0] = width; // In elements, not bytes
    output_buf.extent[1] = buf.extent[1] = height; // In elements, not bytes
    output_buf.extent[2] = buf.extent[2] = 4;    // Assuming RGBA
                                                 //                // No need to assign additional extents as they were init'ed to zero above
    output_buf.stride[0] = buf.stride[0] = 4; // RGBA interleaved
    output_buf.stride[1] = buf.stride[1] = width * 4; // Assuming no line padding
    output_buf.stride[2] = buf.stride[2] = 1; // Channel interleaved
    output_buf.elem_size = buf.elem_size = sizeof(float);

    // Run the pipeline

    int error = s_gamma->realize(&buf, params, &output_buf, width);

但它仍然在 m_gamma.realize 函数上崩溃，控制台中有信息:

Error: Constraint violated: f0.stride.0 (4) == 1 (1)

最佳答案

通过使用 Halide::Param::get() ，您正在从 Param 中提取(默认值为 0)值你打电话时的对象 get() .如果要使用调用生成函数时给定的参数值，直接使用，不用调用get它应该隐式转换为 Expr .

自 Param不可转换为 bool 值，Halide 执行 if 的方式是Halide::select() .

您没有使用 Halide::clamp() 的限制返回值.

我没看到 cS被 Halide 代码使用，只有 C 代码。

现在解决您的 JIT 问题。看起来您开始进行 AOT 编译并切换到 JIT。

你做了一个std::vector<Halide::Argument>但不要在任何地方传递它。 Halide怎么会知道什么Param你想用吗？它查看 Func并找到对 ImageParam 的引用和 Param对象。

你怎么知道它期望 Param 的顺序是什么？？你无法控制这个。我能够通过定义 HL_GENBITCODE=1 来转储位码然后用llvm-dis查看查看您的功能:

int gamma
    ( buffer_t *img
    , float clampMax
    , float key
    , float clampMin
    , float sat
    , void *user_context
    , buffer_t *result
    );

使用gamma.realize(Halide::Buffer(Halide::type_of<float>(), &output_buf))而不是使用 gamma.compile_jit()并尝试正确调用生成的函数。

一次性使用:

使用Image而不是 ImageParam .
使用Expr而不是 Param .

对于单个 JIT 编译的重复使用:

保留 ImageParam和 Param在实现 Func 之前设置它们.

关于c++ - Halide Jit 编译，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/39085251/

文章推荐： c++ - 一个类的双重部分模板特化

文章推荐： java - 登录网站以在java中抓取数据

文章推荐： c++ - 递归函数中由信号 SIGSEGV(地址边界错误)终止

文章推荐： android - 跨越 2 个 fragment 的按钮(xml 布局)

jit - 动态语言是如何 JITted 的？
在动态语言中，动态类型代码 JIT 是如何编译成机器码的？更具体地说:编译器是否会在某个时候推断类型？还是在这些情况下严格解释？例如，如果我有类似下面的伪代码 def func(arg) i
jit - SPARC 体系结构是否仍然与高端服务器上的 JIT 编译器目标相关？
X86 和 AMD64 是许多计算环境(桌面、服务器和 super 计算机)最重要的架构。显然，JIT 编译器应该同时支持它们才能获得认可。直到最近，SPARC 架构才是编译器合乎逻辑的下一步，特别
jit - 将 JIT 转换为 EXE？
既然有如此多的 JIT 实现，每个 JIT 都会发出 native 代码。那么为什么没有人制作像 JIT2EXE 这样的工具来将 native 代码保存为 native 可执行文件呢？最佳答案这个
java - 为什么有 JIT 的解释器比没有 JIT 的解释器产生更快的代码？
JIT 编译器将字节码编译成机器码的概念我还是不太清楚。我想知道为什么它比非 JIT 解释器产生更快的代码。有人可以给我一个很好的例子来说明这个过程是如何完成的吗？最佳答案假设您有一个需要执行一百
pytorch - torchscript中的torch.jit.trace和torch.jit.script有什么区别？
Torchscript 提供了 torch.jit.trace 和 torch.jit.script 将 pytorch 代码从 Eager 模式转换为脚本模型。从文档中，我可以理解 torch.ji
jvm - 不要同时启用 JIT 和非 JIT 的解释器最终生成机器代码
好的，我已经阅读了一些关于 JIT 和非 JIT 启用解释器之间差异的讨论，以及为什么 JIT 通常会提高性能。但是，我的问题是: 最终，不支持 JIT 的解释器是否必须像 JIT 编译器那样将字节
java - 有没有一种方法可以在没有 JIT 开销的情况下实现 JIT 性能？
有没有办法在消除 JIT 开销的同时实现 JIT 性能？最好通过将类文件编译为 native 镜像。我研究过GCJ，但即使对于简单的程序，GCJ输出的性能也比Java JIT差很多。最佳答案您可
Java JIT 编译器优化 - JIT 在 volatile 变量值缓存方面是否一致？
我试图更好地理解 JIT 编译器在 volatile 变量值缓存方面如何为 java 工作。考虑这个问题中提出的例子: Infinite loop problem with while loop an
python - 从 numba jitted 函数调用非 jitted 函数
我的代码是这样的: @jit(nopython=True) def sum_fn(arg1, arg2, ...argn): ..... for i in xrange(len(arg
jit - 是否可以 jit 使用 jax.numpy.unique 的函数？
以下代码无效: def get_unique(arr): return jnp.unique(arr) get_unique = jit(get_unique) get_unique(jnp.
python - 是否可以调用间接调用另一个 cuda.jit 函数的 cuda.jit 函数？
我需要能够调用一个 GPU 函数，该函数本身间接调用另一个 GPU 函数: from numba import cuda, jit import numpy as np # GPU function
cuda - @cuda.jit 和 @jit(target ='gpu') 的区别
我有一个关于使用 Continuum 的 Accelerate 和 numba 包中的 Python CUDA 库的问题。正在使用装饰器@jit与 target = gpu同 @cuda.jit ?
java - JIT 去优化，原因 ="constraint"。为什么 JIT 会去优化方法？
有人可以指出我的方向，这可能会让我明白为什么 JIT 会取消优化我的循环？ (OSR)。看起来它被 C1 编译一次，然后多次取消优化(我可以看到数十或数百个以开头的日志) 这是包含该重要循环的类:
java - 带 JIT 和不带 JIT 的 JVM 之间的区别
我引用了Oracle的以下文档: http://docs.oracle.com/cd/E13150_01/jrockit_jvm/jrockit/geninfo/diagnos/underst_jit
python - pytorch torch.jit.trace 返回函数而不是 torch.jit.ScriptModule
我需要在 C++ 中运行预训练的 pytorch 神经网络模型(在 python 中训练)来进行预测。为此，我按照此处给出的有关如何在 C++ 中加载 pytorch 模型的说明进行操作:https
python - 如何使 numba @jit 使用所有 cpu 内核(并行化 numba @jit)
我正在使用 numbas @jit 装饰器在 python 中添加两个 numpy 数组。如果我使用 @jit 与 python 相比，性能是如此之高。然而，即使我传入 @numba.jit(nop
python - import Numba @jit meet warning message and @jit(nopython=True) 将显示错误
我是Python新手。我编写了一些代码尝试将图片混合为新图片。我完成了，但是浪费了太多时间。所以我尝试使用 Numba 让代码在我的 GPU 上运行。但遇到一些警告和错误 os Ubuntu 1
php - PHP 7 中的 "Allocation of JIT memory failed, PCRE JIT will be disabled"警告
我正在将我的网站从安装在共享网络托管帐户(在 DreamHost)上的 PHP v.5 转换为在 PHP 7.3.11 上运行。转换后，我开始注意到偶尔会收到以下警告: Warning: preg_m
jit - 及时编译总是更快？
在 Stack Overflow 上向所有编译器设计者致以问候。我目前正在从事一个项目，该项目的重点是开发一种用于高性能计算的新脚本语言。源代码首先被编译成字节码表示。字节码然后由运行时加载，它对其
第四次实现 JIT 写保护？
我相信 Apple 已禁止在 ARM64 架构上同时写入和执行内存，请参阅: 参见 mmap() RWX page on MacOS (ARM64 architecture)? 这使得像 jonesf

行者123

个人简介

我是一名优秀的程序员,十分优秀！

作者热门文章

滴滴打车优惠券免费领取

全站热门文章

首页

博学

6Ren·AI

商城

c++ - Halide Jit 编译