halide - 为什么我的表现不好？ (菜鸟调度)-6ren

halide - 为什么我的表现不好？ (菜鸟调度)

转载作者：行者123 更新时间：2023-12-02 08:22:27

26

4

我主要是一名非常高级的程序员，因此思考 CPU 局部性等问题对我来说是非常新鲜的。

我正在研究一个基本的双线性去马赛克(用于 RGGB 传感器数据)，并且我的算法是正确的(根据结果判断)，但它的性能没有我希望的那么好(~210Mpix/s)。

这是我的代码(输入是具有单 channel RGGB 的 4640x3472 图像):

def get_bilinear_debayer(input_raw, print_nest=False):
    x, y, c = Var(), Var(), Var()

    # Clamp and move to 32 bit for lots of space for averaging.
    input = Func()
    input[x,y] = cast(
        UInt(32),
        input_raw[
            clamp(x,0,input_raw.width()-1),
            clamp(y,0,input_raw.height()-1)]
    )

    # Interpolate vertically
    vertical = Func()
    vertical[x,y] = (input[x,y-1] + input[x,y+1])/2

    # Interpolate horizontally
    horizontal = Func()
    horizontal[x,y] = (input[x-1,y] + input[x+1,y])/2

    # Interpolate on diagonals
    diagonal_average = Func()
    diagonal_average[x, y] = (
        input[x+1,y-1] + 
        input[x+1,y+1] +
        input[x-1,y-1] +
        input[x-1,y+1])/4

    # Interpolate on adjacents
    adjacent_average = Func()
    adjacent_average[x, y] = (horizontal[x,y] + vertical[x,y])/2

    red, green, blue = Func(), Func(), Func()

    # Calculate the red channel
    red[x, y, c] = select(
        # Red photosite
        c == 0, input[x, y],
        # Green photosite
        c == 1, select(x%2 == 0, vertical[x,y],
                                 horizontal[x,y]),
        # Blue photosite
        diagonal_average[x,y]
    )

    # Calculate the blue channel
    blue[x, y, c] = select(
        # Blue photosite
        c == 2, input[x, y],
        # Green photosite
        c == 1, select(x%2 == 1, vertical[x,y],
                                 horizontal[x,y]),
        # Red photosite
        diagonal_average[x,y]
    )

    # Calculate the green channel
    green[x, y, c] = select(
        # Green photosite
        c == 1, input[x,y],
        # Red/Blue photosite
        adjacent_average[x,y]
    )

    # Switch color interpolator based on requested color.
    # Specify photosite as third argument, calculated as [x, y, z] = (0, 0, 0), (0, 1, 1), (1, 0, 1), (1, 1, 2)
    # Happily works out to a sum of x mod 2 and y mod 2.
    debayer = Func()
    debayer[x, y, c] = select(c == 0, red[x, y, x%2 + y%2],
                              c == 1, green[x, y, x%2 + y%2],
                                      blue[x, y, x%2 + y%2])


    # Scheduling
    x_outer, y_outer, x_inner, y_inner, tile_index = Var(), Var(), Var(), Var(), Var()

    bits = input_raw.get().type().bits

    output = Func()
    # Cast back to the original colour space
    output[x,y,c] = cast(UInt(bits), debayer[x,y,c])
    # Reorder so that colours are calculated in order (red runs, then green, then blue)

    output.reorder_storage(c, x, y)
    # Tile in 128x128 squares
    output.tile(x, y, x_outer, y_outer, x_inner, y_inner, 128, 128)
    # Vectorize based on colour
    output.bound(c, 0, 3)
    output.vectorize(c)
    # Fuse and parallelize
    output.fuse(x_outer, y_outer, tile_index)
    output.parallel(tile_index)

    # Debugging
    if print_nest:
        output.print_loop_nest()
        debayer.print_loop_nest()
        red.print_loop_nest()
        green.print_loop_nest()
        blue.print_loop_nest()

    return output

老实说，我不知道我在这里做什么，而且我对此太陌生，不知道在哪里或该看什么。

任何关于如何改进日程安排的建议都会有帮助。我仍在学习，但很难找到反馈。

我的时间表是我能做到的最好的，但它几乎完全是反复试验。

编辑:我通过直接在函数中进行整个相邻平均求和并对 x_inner 而不是颜色进行矢量化，额外增加了 30Mpix/s。

编辑:新时间表:

# Set input bounds. output.bound(x, 0, (input_raw.width()/2)*2) output.bound(y, 0, (input_raw.height()/2)*2) output.bound(c, 0, 3) # Reorder so that colours are calculated in order (red runs, then green, then blue) output.reorder_storage(c, x, y) output.reorder(c, x, y) # Tile in 128x128 squares output.tile(x, y, x_outer, y_outer, x_inner, y_inner, 128, 128) output.unroll(x_inner, 2).unroll(y_inner,2) # Vectorize based on colour output.unroll(c) output.vectorize(c) # Fuse and parallelize output.fuse(x_outer, y_outer, tile_index) output.parallel(tile_index)
编辑:最终时间表现在击败(640MP/s)Intel Performance Primitive benchmark在 CPU twice as powerful as mine 上运行:

output = Func() # Cast back to the original colour space output[x,y,c] = cast(UInt(bits), debayer[x,y,c]) # Set input bounds. output.bound(x, 0, (input_raw.width()/2)*2) output.bound(y, 0, (input_raw.height()/2)*2) output.bound(c, 0, 3) # Tile in 128x128 squares output.tile(x, y, x_outer, y_outer, x_inner, y_inner, 128, 128) output.unroll(x_inner, 2).unroll(y_inner, 2) # Vectorize based on colour output.vectorize(x_inner, 16) # Fuse and parallelize output.fuse(x_outer, y_outer, tile_index) output.parallel(tile_index) target = Target() target.arch = X86 target.os = OSX target.bits = 64 target.set_feature(AVX) target.set_feature(AVX2) target.set_feature(SSE41) output.compile_jit(target)

最佳答案

确保您使用 unroll(c) 来优化每个 channel 的选择逻辑。在 x 和 y 方向上展开 2 也会有帮助:

output.unroll(x, 2).unroll(y,2)

目标是优化偶数/奇数行和列之间的选择逻辑。为了充分利用这一点，您可能还需要告诉 Halide 最小值和范围是 2 的倍数:

output.output_buffer().set_bounds(0, (f.output_buffer().min(0) / 2) * 2, (output.output_buffer().extent(0) / 2) * 2) output.output_buffer().set_bounds(1, (f.output_buffer().min(1) / 2) * 2, (output.output_buffer().extent(1) / 2) * 2)

尽管可能值得说明更严格的约束，例如使用 128 而不是 2 来断言图 block 大小的倍数，或者如果仅支持单个相机，则仅硬连线最小值和范围以反射(reflect)实际传感器参数。

关于halide - 为什么我的表现不好？ (菜鸟调度)，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/33314554/

26

4

0

文章推荐： django - 如何在 Wagtail 中执行发布操作？

文章推荐： vb.net - VB.NET 最好的模拟框架是什么？

文章推荐： vba - 暂停宏并让用户选择颜色

文章推荐： django:在 View 层调试代码

c - 为什么回调函数中的值不同 - 不好？
非常简单的应用程序 - 您可以复制 - 粘贴 - 运行。主要只是“创建”应用程序。 - 这不是问题(可能) #include #include #include #include typede
Python len() 不好
关闭。这个问题需要details or clarity .它目前不接受答案。想改进这个问题吗？通过 editing this post 添加细节并澄清问题. 关闭 6 年前。 Improve t
haskell - 为什么 seq 不好？
Haskell 有一个名为 seq 的神奇函数，它接受任何类型的参数并将其简化为弱头范式 (WHNF)。我读过一些资料[但我现在不记得他们是谁了...]，它们声称“多态 seq 很糟糕”。他们在哪些
sql - 为什么在本地服务器上使用 OPENQUERY 不好？
我正在编写一个脚本，该脚本应该在一堆服务器周围运行并从中选择一堆数据，包括本地服务器。选择我需要的数据所需的 SQL 非常复杂，所以我正在编写一种临时 View ，并使用 OPENQUERY 语句来获
scala - 将案例类用于可变状态是否(真的)不好？
考虑以下代码: case class Vector3(var x: Float, var y: Float, var z: Float) { def add(v: Vector3): Unit =
java - 为什么在守护线程上调用 Join 不好
我正在读这个SO post关于守护线程，答案底部的引述是: But joining a demonized thread opens most likely a whole can of troubl
java - 为什么同步 RPC 不好
在阅读有关 Google webtool 工具包的内容时，看到一条声明说“同步 RPC 不好”。他们有什么理由吗？我能想到的一个很好的理由是，对最终用户的响应可能会受到远程服务器延迟或网络问题的影
java - iText - PDF 不好
我有以下 HTML: A Simple Sample Web Page By Sheldon Brown Demonstrating a few HTML feat
java - 输出的第一部分(打印星星)不好
我正在做一项简单的任务，但我陷入困境...... output 我需要使第一行与其他所有内容保持一致，但无论我做什么，它都不想接受空格。那么，我应该纠正什么以及为什么？谢谢 public static
c++ - 解释为什么双用途类(class)不好
我在系统中有一个类，其目的列为“这可以是从午夜算起的秒数。或者带有日期的时间。”我试图解释这有多糟糕，但我无法理解我的观点。有没有人对如何解决这个问题有任何想法。 http://code-slim-j
c++ - 为什么#define 不好？
这个问题在这里已经有了答案: 关闭 11 年前。 Possible Duplicate: When are C++ macros beneficial? Why is #define bad and
javascript - 为什么有些人认为 JavaScript 不好？
关闭。这个问题是opinion-based .它目前不接受答案。想要改进这个问题？更新问题，以便 editing this post 可以用事实和引用来回答它. 关闭 8 年前。 Improve
javascript - 为什么内联 JavaScript 不好？
始终建议通过将所有代码放在 JS 文件中来避免内联 Javascript 代码，该文件包含在所有页面中。我想知道，这是否不会导致繁重的页面出现性能问题。例如，假设我们有几十个这样的函数 functi
javascript - 为什么父组件和子组件之间的双向绑定(bind)不好？
我主要在 AngularJS 中进行开发，最近我正在研究 Vue.js 并阅读它的指南，在它提到的一页上: By default, all props form a one-way-down bind
c# - 什么时候使用 NotSupportedException 不好？
我正在构建一个本地化目录，但遇到了设计难题。现在，目录存储一个 Dictionary存储翻译，其中 IString可以是两种类型:Singular或 Plural .这是 IString 的简化版本:
c++ - 为什么在创建矩阵类时使用 vector 不好？
对于我的矩阵类，我做了: template class Matrix { private: std::array, Height> Elements; stat
c# - 为什么 lock(this) {...} 不好？
MSDN documentation说 public class SomeObject { public void SomeOperation() { lock(this) {
python - 为什么 "import *"不好？
建议不要在 Python 中使用 import *。谁能分享一下原因，这样我下次就可以避免了？最佳答案因为它会将很多东西放入您的命名空间(可能会影响之前导入的一些其他对象，而您不会知道它)。因
pmd - 为什么 System.out.println 不好？
关闭。这个问题不满足Stack Overflow guidelines .它目前不接受答案。想改善这个问题吗？更新问题，使其成为 on-topic对于堆栈溢出。 7年前关闭。 Improve thi
language-agnostic - 如何告诉某人他们对我的程序的 mod 不好？
G'day, 这与my question on star developers有关并到 this question regarding telling someone that they're wri

首页

博学

6Ren·AI

商城

halide - 为什么我的表现不好？ (菜鸟调度)