gpt4 book ai didi

c# - GPU 内核中的指针和位运算符

转载 作者:太空宇宙 更新时间:2023-11-03 13:41:39 24 4
gpt4 key购买 nike

我想使用 GPU 内核对体积执行双阈值。我将每个切片的卷作为只读 image2d_t 发送。我的输出体积是一个二进制体积,其中每一位 指定它的相关体素是启用还是禁用。我的内核检查当前像素值是否在下限/上限阈值范围内,并在二进制体积中启用其对应位。

出于调试目的,我暂时对实际检查进行了评论。我只是使用传递的切片 nr 来确定二进制卷位是否应该打开或关闭。前 14 个切片设置为“打开”,其余设置为“关闭”。我还在 CPU 端验证了这段代码,我粘贴在这篇文章底部的代码。代码显示了两条路径,现在对 CPU 进行了注释。

CPU 代码按预期工作,在应用二进制掩码渲染体积​​后返回以下图像:

Rendering with a correct computed mask

使用我的 GPU 内核运行完全相同的逻辑会返回错误的结果(第一个 3D,第二个切片 View ):

Rendering with an incorrect GPU computed mask

Rendering with an incorrect GPU computed mask (sliceview)

这里出了什么问题?我读到 OpenCL 不支持位字段,但据我从 OpenCL 规范中了解到,它确实支持按位运算符。支持我的位逻辑,它从 32 位字中选择正确的位并翻转它,对吗?或者我的简单标志被认为是一个位域。它所做的是从左侧(不是右侧,因此是减法)选择体素 %32 位。

另一件事可能是传递给我内核的 uint 指针与我期望的不同。我假设这是对指针的有效使用并将数据传递到我的内核。应用于内核中“uint* 字”部分的逻辑是由于每行填充字和每片填充行。 CPU 变体确认指针计算逻辑是有效的。

下面;代码

            uint wordsPerRow = (uint)BinaryVolumeWordsPerRow(volume.Geometry.NumberOfVoxels);
uint wordsPerPlane = (uint)BinaryVolumeWordsPerPlane(volume.Geometry.NumberOfVoxels);

int[] dims = new int[3];
dims[0] = volume.Geometry.NumberOfVoxels.X;
dims[1] = volume.Geometry.NumberOfVoxels.Y;
dims[2] = volume.Geometry.NumberOfVoxels.Z;

uint[] arrC = dstVolume.BinaryData.ObtainArray() as uint[];
unsafe {
fixed(int* dimPtr = dims) {
fixed(uint *arrcPtr = arrC) {
// pick Cloo Platform
ComputePlatform platform = ComputePlatform.Platforms[0];

// create context with all gpu devices
ComputeContext context = new ComputeContext(ComputeDeviceTypes.Gpu,
new ComputeContextPropertyList(platform), null, IntPtr.Zero);

// load opencl source
StreamReader streamReader = new StreamReader(@"C:\views\pii-sw113v1\PMX\ADE\Philips\PmsMip\Private\Viewing\Base\BinaryVolumes\kernels\kernel.cl");
string clSource = streamReader.ReadToEnd();
streamReader.Close();

// create program with opencl source
ComputeProgram program = new ComputeProgram(context, clSource);

// compile opencl source
program.Build(null, null, null, IntPtr.Zero);

// Create the event wait list. An event list is not really needed for this example but it is important to see how it works.
// Note that events (like everything else) consume OpenCL resources and creating a lot of them may slow down execution.
// For this reason their use should be avoided if possible.
ComputeEventList eventList = new ComputeEventList();

// Create the command queue. This is used to control kernel execution and manage read/write/copy operations.
ComputeCommandQueue commands = new ComputeCommandQueue(context, context.Devices[0], ComputeCommandQueueFlags.None);

// Create the kernel function and set its arguments.
ComputeKernel kernel = program.CreateKernel("LowerThreshold");

int slicenr = 0;
foreach (IntPtr ptr in pinnedSlices) {
/*// CPU VARIANT FOR TESTING PURPOSES
for (int y = 0; y < dims[1]; y++) {
for (int x = 0; x < dims[0]; x++) {
long pixelOffset = x + y * dims[0];
ushort* ushortPtr = (ushort*)ptr;
ushort pixel = *(ushortPtr + pixelOffset);

int BinaryWordShift = 5;
int BinaryWordBits = 32;
if (
(0 <= x) &&
(0 <= y) &&
(0 <= slicenr) &&
(x < dims[0]) &&
(y < dims[1]) &&
(slicenr < dims[2])
) {
uint* word =
arrcPtr + 1 + (slicenr * wordsPerPlane) +
(y * wordsPerRow) +
(x >> BinaryWordShift);
uint mask = (uint)(0x1 << ((BinaryWordBits - 1) - (byte)(x & 0x1f)));
//if (pixel > lowerThreshold && pixel < upperThreshold) {
if (slicenr < 15) {
*word |= mask;
} else {
*word &= ~mask;
}
}
}
}*/

ComputeBuffer<int> dimsBuffer = new ComputeBuffer<int>(
context,
ComputeMemoryFlags.ReadOnly | ComputeMemoryFlags.CopyHostPointer,
3,
new IntPtr(dimPtr));

ComputeImageFormat format = new ComputeImageFormat(ComputeImageChannelOrder.Intensity, ComputeImageChannelType.UnsignedInt16);
ComputeImage2D image2D = new ComputeImage2D(
context,
ComputeMemoryFlags.ReadOnly,
format,
volume.Geometry.NumberOfVoxels.X,
volume.Geometry.NumberOfVoxels.Y,
0,
ptr
);

// The output buffer doesn't need any data from the host. Only its size is specified (arrC.Length).
ComputeBuffer<uint> c = new ComputeBuffer<uint>(
context, ComputeMemoryFlags.WriteOnly, arrC.Length);

kernel.SetMemoryArgument(0, image2D);
kernel.SetMemoryArgument(1, dimsBuffer);
kernel.SetValueArgument(2, wordsPerRow);
kernel.SetValueArgument(3, wordsPerPlane);
kernel.SetValueArgument(4, slicenr);
kernel.SetValueArgument(5, lowerThreshold);
kernel.SetValueArgument(6, upperThreshold);
kernel.SetMemoryArgument(7, c);

// Execute the kernel "count" times. After this call returns, "eventList" will contain an event associated with this command.
// If eventList == null or typeof(eventList) == ReadOnlyCollection<ComputeEventBase>, a new event will not be created.
commands.Execute(kernel, null, new long[] { dims[0], dims[1] }, null, eventList);

// Read back the results. If the command-queue has out-of-order execution enabled (default is off), ReadFromBuffer
// will not execute until any previous events in eventList (in our case only eventList[0]) are marked as complete
// by OpenCL. By default the command-queue will execute the commands in the same order as they are issued from the host.
// eventList will contain two events after this method returns.
commands.ReadFromBuffer(c, ref arrC, false, eventList);

// A blocking "ReadFromBuffer" (if 3rd argument is true) will wait for itself and any previous commands
// in the command queue or eventList to finish execution. Otherwise an explicit wait for all the opencl commands
// to finish has to be issued before "arrC" can be used.
// This explicit synchronization can be achieved in two ways:
// 1) Wait for the events in the list to finish,
//eventList.Wait();
//}
// 2) Or simply use
commands.Finish();

slicenr++;
}

}
}
}

还有我的内核代码:

const sampler_t smp = CLK_FILTER_NEAREST | CLK_ADDRESS_CLAMP |   CLK_NORMALIZED_COORDS_FALSE;
kernel void LowerThreshold(
read_only image2d_t image,
global int* brickSize,
uint wordsPerRow,
uint wordsPerPlane,
int slicenr,
int lower,
int upper,
global write_only uint* c )
{

int4 coord = (int4)(get_global_id(0),get_global_id(1),slicenr,1);
uint4 pixel = read_imageui(image, smp, coord.xy);

uchar BinaryWordShift = 5;
int BinaryWordBits = 32;
if (
(0 <= coord.x) &&
(0 <= coord.y) &&
(0 <= coord.z) &&
(coord.x < brickSize[0]) &&
(coord.y < brickSize[1]) &&
(coord.z < brickSize[2])
) {
global uint* word =
c + 1 + (coord.z * wordsPerPlane) +
(coord.y * wordsPerRow) +
(coord.x >> BinaryWordShift);

uint mask = (uint)(0x1 << ((BinaryWordBits - 1) - (uchar)(coord.x & 0x1f)));
//if (pixel.w > lower && pixel.w < upper) {
if (slicenr < 15) {
*word |= mask;
} else {
*word &= ~mask;
}
}
}

最佳答案

两个问题:

  1. 您已将“c”声明为“write_only”,但使用了“|=”和“&=”运算符,它们是读-修改-写

  2. 正如其他发帖者所提到的,如果两个工作项正在访问同一个词,则读取-修改-写入之间存在竞争条件,这将导致错误。原子操作比非原子操作慢得多,因此尽管可能,但不推荐。

我建议将输出扩大 8 倍并使用字节而不是位。这将使您的输出成为只写的,并且还会消除争用,从而消除竞争条件。

或者(如果数据紧凑性或格式很重要)每个工作项一次处理 8 个元素,并将复合 8 位输出写为单个字节。这将是只写的,没有争用,并且仍然具有您的数据紧凑性。

关于c# - GPU 内核中的指针和位运算符,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/16786685/

24 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com