glsl - 当 memoryBarrier 不同步时，为什么屏障同步共享内存？-6ren

glsl - 当 memoryBarrier 不同步时，为什么屏障同步共享内存？

转载作者：行者123 更新时间：2023-12-03 23:26:29

24

4

下面的 GLSL 计算着色器简单地复制了 inImage至 outImage .它源自更复杂的后处理过程。

在main()的前几行，单个线程将 64 个像素的数据加载到共享数组中。然后，在同步之后，64 个线程中的每一个都将一个像素写入输出图像。

根据我的同步方式，我得到不同的结果。原本以为memoryBarrierShared()将是正确的调用，但它产生以下结果:

Unsynchronized result

这与没有同步或使用 memoryBarrier() 的结果相同反而。

如果我使用 barrier() ，我得到以下(期望的)结果:

enter image description here

strip 宽度为 32 像素，如果我将工作组大小更改为小于或等于 32 的任何值，我会得到正确的结果。

这里发生了什么？我是不是误解了 memoryBarrierShared() 的目的？ ?为什么要barrier()工作？

#version 430

#define SIZE 64

layout (local_size_x = SIZE, local_size_y = 1, local_size_z = 1) in;

layout(rgba32f) uniform readonly  image2D inImage;
uniform writeonly image2D outImage;

shared vec4 shared_data[SIZE];

void main() {
    ivec2 base = ivec2(gl_WorkGroupID.xy * gl_WorkGroupSize.xy);
    ivec2 my_index = base + ivec2(gl_LocalInvocationID.x,0);

    if (gl_LocalInvocationID.x == 0) {
        for (int i = 0; i < SIZE; i++) {
            shared_data[i] = imageLoad(inImage, base + ivec2(i,0));
        }
    }

    // with no synchronization:   stripes
    // memoryBarrier();        // stripes
    // memoryBarrierShared();  // stripes
    // barrier();              // works

    imageStore(outImage, my_index, shared_data[gl_LocalInvocationID.x]);
}

最佳答案

图像加载存储和 friend 的问题是，实现不能再确定着色器仅更改其专用输出值的数据(例如片段着色器之后的帧缓冲区)。这更适用于计算着色器，它们没有专用的输出，而仅通过将数据写入可写存储来输出事物，例如图像、存储缓冲区或原子计数器。这可能需要在各个 channel 之间进行手动同步，否则尝试访问纹理的片段着色器可能不会将最新的数据写入该纹理，并通过先前的 channel (如计算着色器)进行图像存储操作。

因此，您的计算着色器可能工作得很好，但是与以下显示(或其他)传递(需要以某种方式读取此图像数据)的同步失败了。为此，存在 glMemoryBarrier 功能。根据您在显示 channel (或更准确地说是在计算着色器 channel 之后读取图像的 channel )中读取图像数据的方式，您需要为此函数提供不同的标志。如果您使用纹理阅读它，请使用 GL_TEXTURE_FETCH_BARRIER_BIT , 如果再次使用图像加载，请使用 GL_SHADER_IMAGE_ACCESS_BARRIER_BIT , 如果使用 glBlitFramebuffer用于展示，使用 GL_FRAMEBUFFER_BARRIER_BIT ...

虽然我在图像加载/存储和手动内存同步方面没有太多经验，但这只是我从理论上提出的。因此，如果有人知道得更好，或者您已经使用了正确的 glMemoryBarrier ，然后随时纠正我。同样，这不一定是您唯一的错误(如果有)。但是链接的 Wiki 文章中的最后两点实际上解决了您的用例，恕我直言，您需要某种 glMemoryBarrier :

Data written to image variables in one rendering pass and read by the shader in a later pass need not use coherent variables or memoryBarrier(). Calling glMemoryBarrier with the SHADER_IMAGE_ACCESS_BARRIER_BIT set in barriers between passes is necessary.

Data written by the shader in one rendering pass and read by another mechanism (e.g., vertex or index buffer pulling) in a later pass need not use coherent variables or memoryBarrier(). Calling glMemoryBarrier with the appropriate bits set in barriers between passes is necessary.

编辑:其实 Wiki article on compute shaders说

Shared variable access uses the rules for incoherent memory access. This means that the user must perform certain synchronization in order to ensure that shared variables are visible.

Shared variables are all implicitly declared coherent, so you don't need to (and can't use) that qualifier. However, you still need to provide an appropriate memory barrier.

The usual set of memory barriers is available to compute shaders, but they also have access to memoryBarrierShared(); this barrier is specifically for shared variable ordering. groupMemoryBarrier() acts like memoryBarrier(), ordering memory writes for all kinds of variables, but it only orders read/writes for the current work group.

While all invocations within a work group are said to execute "in parallel", that doesn't mean that you can assume that all of them are executing in lock-step. If you need to ensure that an invocation has written to some variable so that you can read it, you need to synchronize execution with the invocations, not just issue a memory barrier (you still need the memory barrier though).

To synchronize reads and writes between invocations within a work group, you must employ the barrier() function. This forces an explicit synchronization between all invocations in the work group. Execution within the work group will not proceed until all other invocations have reach this barrier. Once past the barrier(), all shared variables previously written across all invocations in the group will be visible.

所以这实际上听起来你需要 barrier那里和 memoryBarrierShared还不够(尽管你不需要两者都需要，正如最后一句话所说)。内存屏障只会同步内存，但不会停止跨越它的线程的执行。因此，如果第一个线程已经写了一些东西，线程将不会从共享内存中读取任何旧的缓存数据，但是在第一个线程尝试写任何东西之前，它们可以很好地达到读取点。

这实际上完全符合以下事实:对于 32 及以下的块大小，它可以工作并且前 32 个像素可以工作。至少在 NVIDIA 硬件上，32 是扭曲大小，因此是以完美锁步操作的线程数。所以前 32 个线程(好吧，每个 32 个线程块)总是完全并行工作(好吧，概念上就是这样)，因此它们不会引入任何竞争条件。这也是为什么如果您知道自己在单个扭曲(一种常见优化)内工作，则实际上不需要任何同步的原因。

关于glsl - 当 memoryBarrier 不同步时，为什么屏障同步共享内存？，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/17430443/

24

4

0

文章推荐： Scala 宏，它们在哪里使用？

文章推荐： forms - 覆盖表单模板 Symfony2， Twig 元素？

文章推荐： R函数返回包的许可证？

文章推荐： r - xtable 标题保持在下方

IMAP 同步
我正在实现 IMAP 客户端，但 IMAP 邮箱同步出现问题。首先，可以从 IMAP 服务器获取新邮件，但我不知道如何从邮箱中查找已删除的邮件。我是否应该从服务器获取所有消息并将其与本地数据进行比
Java-同步
我研究线程同步。当我有这个例子时: class A { public synchronized void methodA(){ } public synchronized void met
Java——同步
嗨，我做了一个扩展线程的东西，它添加了一个包含 IP 的对象。然后我创建了该线程的两个实例并启动它们。他们使用相同的列表。我现在想使用 Synchronized 来阻止并发更新问题。但它不起作用，我
javascript - 同步
我正在尝试使用 FTP 定期将小数据文件从程序上传到服务器。用户从使用 javascript XMLHttpRequest 函数读取数据的网页访问数据。这一切似乎都有效，但我正在努力解决由 FTP 和
JavaScript 同步
我不知道如何同步下一个代码: javascript: (function() { var s2 = document.createElement('script'); s2.src =
Java 同步
关闭。这个问题需要更多focused .它目前不接受答案。想改进这个问题吗？更新问题，使其只关注一个问题 editing this post . 关闭 7 年前。 Improve this qu
同步 EDA 框架设计
一点睛 1 Message 在基于 Message 的系统中，每一个 Event 也可以被称为 Message，Message 是对 Event 更高一个层级的抽象，每一个 Message 都有一个
同步 EDA 框架设计
一点睛 1 Message 在基于 Message 的系统中，每一个 Event 也可以被称为 Message，Message 是对 Event 更高一个层级的抽象，每一个 Message 都有一个
jquery - getJSON 同步
目标:我所追求的是每次在数据库中添加某些内容时(在 $.ajax 到 Submit_to_db.php 之后)，从数据库获取数据并刷新 main.php(通过 draw_polygon 更明显)。所
iphone - 同步 CAAnimations
我有一个重复动画，需要与其他一些 transient 动画同步。重复动画是一条在屏幕上移动 4 秒的扫描线。当它经过下面的图像时，这些图像需要“闪烁”。闪烁的图像可以根据用户的意愿来来去去和移动。它
cuda block 同步
我有 b 个块，每个块有 t 个线程。我可以用 __syncthreads() 同步特定块中的线程。例如 __global__ void aFunction() { for(i=0;i #
azure - Azure表实体存在/同步
我正在使用azure表查询来检索分配给用户的所有错误实体。此外，我更改了实体的属性以声明该实体处于处理模式。处理完实体后，我将从表中删除该实体。当我进行并行测试时，可能会发生查询期间，一个实体已
不同机器上的应用程序访问时的 SQLite 同步
我想知道 SQLite 是如何实现它的。它基于文件锁定吗？当然，并不是每个访问它的用户都锁定了整个数据库；那效率极低。它是基于多个文件还是仅基于一个大文件？如果有人能够简要概述一下 sqlite 中
javascript - jquery如何发布到php，同步
我想post到php，当id EmpAgree1时，然后它的post变量EmpAgree=1；当id为EmpAgree2时，则后置变量EmpAgree=2等。但只是读取i的最后一个值，为什么？以及如何
cuda - CUBLAS 同步
CUBLAS 文档提到我们在读取标量结果之前需要同步: “此外，少数返回标量结果的函数，例如 amax()、amin、asum()、rotg()、rotmg()、dot() 和 nrm2()，通过引用
Java RMI 同步
我知道下面的代码中缺少一些内容，我的问题是关于 RemoteImplementation 中的同步机制。我还了解到该网站和其他网站上有几个关于 RMI 和同步的问题；我在这里寻找明确的确认/矛盾。我
java - AOP+同步
我不太确定如何解决这个问题......所以我可能需要几次尝试才能正确回答这个问题。我有一个用于缓存方法结果的注释。我的代码目前是一个私有(private)分支，但我正在处理的部分从这里开始: http
java - java中的线程/同步
我对 Java 非常失望，因为它不允许以下代码尽可能地并发移动。当没有同步时，两个线程会更频繁地切换，但是当尝试访问同步方法时，在第二个线程获得锁之前以及在第一个线程获得锁之前再次花费太长时间(比如
java - Kotlin 同步
过去几周我一直在研究java多线程。我了解了synchronized，并理解synchronized避免了多个线程同时访问相同的属性。我编写此代码是为了在同一线程中运行两个线程。 val gate =
另一个线程内的 Java 同步
我有一个关于 Java 同步的简单问题。请假设以下代码: public class Test { private String address; private int age;

首页

博学

6Ren·AI

商城

glsl - 当 memoryBarrier 不同步时，为什么屏障同步共享内存？