- Java 双重比较
- java - 比较器与 Apache BeanComparator
- Objective-C 完成 block 导致额外的方法调用?
- database - RESTful URI 是否应该公开数据库主键?
我正在尝试修改对 cl_int
数组进行排序的 the Intel's Bitonic Sorting 算法,以对 cl_int2
数组进行排序(基于键 - 即 cl_int2.x
)。
英特尔的示例包含一个简单的主机代码和一个 OpenCL 内核,该内核在一次排序操作(多 channel )期间被多次调用。内核一次加载 4 个数组项作为 cl_int4
并对它们进行操作。
我没有修改主机代码算法,只修改了设备代码。 内核函数的变化列表:
int4*
修改为 int8*
(以加载四个键值对).even
元素的 theArray
组件来比较值 ( <
)pseudomask
” ( int4
) 并基于此创建 mask
作为 pseudomask.xxyyzzww
(以捕获值)尽管我修改后的内核的输出完全按照第一个组件 ( cl_int2
) 排序 cl_int2.x
数组,但值 ( cl_int2.y
) 不正确——一个项目的值在接下来的 4 或 8 个项目中重复,然后使用并重复新值...
我确定有一个微不足道的错误,但我无法找到它。
Diff of the original Intel code and my modified version .
cl_int2
) 是唯一的时,cl_int2.x
数组被完美排序。示例输入:http://pastebin.com/92qB1csT
示例输出:http://pastebin.com/dsU97Npn
(正确排序的数组:http://pastebin.com/Nb56BuQK)
修改后的内核代码(注释):
// Copyright (c) 2009-2011 Intel Corporation
// https://software.intel.com/en-us/articles/bitonic-sorting
// Modified to sort int2 key-value array
__kernel void BitonicSort(__global int8* theArray,
const uint stage,
const uint passOfStage,
const uint dir)
{
size_t i = get_global_id(0);
int8 srcLeft, srcRight, mask;
int4 pseudomask;
int4 imask10 = (int4)(0, 0, -1, -1);
int4 imask11 = (int4)(0, -1, 0, -1);
if(stage > 0)
{
if(passOfStage > 0) // upper level pass, exchange between two fours,
{
size_t r = 1 << (passOfStage - 1);
size_t lmask = r - 1;
size_t left = ((i>>(passOfStage-1)) << passOfStage) + (i & lmask);
size_t right = left + r;
srcLeft = theArray[left];
srcRight = theArray[right];
pseudomask = srcLeft.even < srcRight.even;
mask = pseudomask.xxyyzzww;
int8 imin = (srcLeft & mask) | (srcRight & ~mask);
int8 imax = (srcLeft & ~mask) | (srcRight & mask);
if( ((i>>(stage-1)) & 1) ^ dir )
{
theArray[left] = imin;
theArray[right] = imax;
}
else
{
theArray[right] = imin;
theArray[left] = imax;
}
}
else // last pass, sort inside one four
{
srcLeft = theArray[i];
srcRight = srcLeft.s45670123;
pseudomask = (srcLeft.even < srcRight.even) ^ imask10;
mask = pseudomask.xxyyzzww;
if(((i >> stage) & 1) ^ dir)
{
srcLeft = (srcLeft & mask) | (srcRight & ~mask);
srcRight = srcLeft.s23016745;
pseudomask = (srcLeft.even < srcRight.even) ^ imask11;
mask = pseudomask.xxyyzzww;
theArray[i] = (srcLeft & mask) | (srcRight & ~mask);
}
else
{
srcLeft = (srcLeft & ~mask) | (srcRight & mask);
srcRight = srcLeft.s23016745;
pseudomask = (srcLeft.even < srcRight.even) ^ imask11;
mask = pseudomask.xxyyzzww;
theArray[i] = (srcLeft & ~mask) | (srcRight & mask);
}
}
}
else // first stage, sort inside one four
{
/*
* To convert this code to int2 sorter, do this:
* 1. instead of loading int4, load int8 (key,value, key,value, ...)
* 2. when there is a vector swizzling, replace component index with two consecutive indices:
* srcLeft.yxwz -> srcLeft.s23016745
* use this rewrite rule:
* x y z w
* 01 23 45 67
* 3. replace comparison operands with only their keys swizzled:
* mask = srcLeft < srcRight; -> pseudomask = srcLeft.even < srcRight.even; mask = pseudomask.xxyyzzww;
*/
// make bitonic sequence out of 4.
int4 imask0 = (int4)(0, -1, -1, 0); // -1 in comparison = true (all bits set - two's complement)
srcLeft = theArray[i];
srcRight = srcLeft.s23016745;
/*
* This XOR mask flips bits, so that in `mask` are the following
* results (remember that srcRight is srcLeft with swapped component pairs):
*
* [ left.x<left.y, left.x<left.y, left.w<left.z, left.w<left.z ]
* or: [ left.x<left.y, left.x<left.y, left.z>left.w, left.z>left.w ]
*/
pseudomask = (srcLeft.even < srcRight.even) ^ imask0;
mask = pseudomask.xxyyzzww;
if( dir )
srcLeft = (srcLeft & mask) | (srcRight & ~mask); // make sure the numbers are sorted like this:
else
srcLeft = (srcLeft & ~mask) | (srcRight & mask);
/*
* Now the pairs of numbers in `srcLeft` are sorted according to the specified `dir`ection.
* If dir == true, then
* The components `x` and `y` are swapped so that `x` < `y`. Moreover `z` and `w` are swapped so that `z` > `w`. This resembles up-hill: /\
* else
* The components `x` and `y` are swapped so that `x` > `y`. Moreover `z` and `w` are swapped so that `z` < `w`. This resembles down-hill: \/
*
* This swapping is achieved by creating `srcLeft`, which is in normal order, and `srcRight`, which has component pairs switched (xyzw -> yxwz).
* Then the `mask` is created. The mask bits are redundant because it applies to vector component pairs (so in order to implement key-value sorting,
* I have to increase the length of masks!).
*
* The non-ordered component pairs in `srcLeft` are masked out by `mask` while the inverted `mask` is applied to the (pair-wise switched) `srcRight`.
*
* This (the previous) first flipping just makes a 4-bitonic sequence.
*/
/*
* This second step just sorts the bitonic sequence
*/
srcRight = srcLeft.s45670123; // inverts the bitonic sequence
// [ left.a<left.c, left.b<left.d, left.a<left.c, left.b<left.d ]
pseudomask = (srcLeft.even < srcRight.even) ^ imask10; // imask10 = (noflip, noflip, flip, flip)
mask = pseudomask.xxyyzzww;
// even or odd (The output of this thread is sorted monotonic sequence. The monotonicity changes and thus preparing bitonic sequence for the next pass.).
if((i & 1) ^ dir)
{
// this sorts the bitonic sequence, hence splitting it
srcLeft = (srcLeft & mask) | (srcRight & ~mask);
srcRight = srcLeft.s23016745;
pseudomask = (srcLeft.even < srcRight.even) ^ imask11;
mask = pseudomask.xxyyzzww;
theArray[i] = (srcLeft & mask) | (srcRight & ~mask);
}
else
{
srcLeft = (srcLeft & ~mask) | (srcRight & mask);
srcRight = srcLeft.s23016745;
pseudomask = (srcLeft.even < srcRight.even) ^ imask11;
mask = pseudomask.xxyyzzww;
theArray[i] = (srcLeft & ~mask) | (srcRight & mask);
}
}
}
主机端代码:
void ExecuteSortKernel(cl_kernel kernel, cl_command_queue queue, cl_mem cl_input_buffer, cl_int arraySize, cl_uint sortAscending)
{
cl_int numStages = 0;
cl_int stage;
cl_int passOfStage;
for (cl_int temp = arraySize; temp > 2; temp >>= 1)
numStages++;
clSetKernelArg(kernel, 0, sizeof(cl_mem), (void *) &cl_input_buffer);
clSetKernelArg(kernel, 3, sizeof(cl_uint), (void *) &sortAscending);
for (stage = 0; stage < numStages; stage++) {
clSetKernelArg(kernel, 1, sizeof(cl_uint), (void *) &stage);
for (passOfStage = stage; passOfStage >= 0; passOfStage--) {
clSetKernelArg(kernel, 2, sizeof(cl_uint), (void *) &passOfStage);
// set work-item dimensions
size_t gsz = arraySize / (2*4);
size_t global_work_size[1] = { passOfStage ? gsz : gsz << 1 }; //number of quad items in input array
// execute kernel
clEnqueueNDRangeKernel(queue, kernel, 1, NULL, global_work_size, NULL, 0, NULL, NULL);
}
}
}
最佳答案
我终于解决了这个问题!
棘手的部分在于原始英特尔代码处理加载的 4 元组中相邻对的相等值的方式 — 它没有明确处理它!
错误存在于所有其他 stage
的最后一个 passOfStage
(即 passOfStage = 0
)中的第一个 stage
和中。这些代码部分在一个 4 元组(由 cl_int8
数组 theArray
表示)内交换各个 2 元组。
让我们以这个摘录为例(对于 4 元组中的相等相邻 2 元组,它不能正常工作):
imask0 = (int4)(0, -1, -1, 0);
srcLeft = theArray[i]; // int8
srcRight = srcLeft.s23016745;
pseudomask = (srcLeft.even < srcRight.even) ^ imask0;
mask = pseudomask.xxyyzzww;
result = (srcLeft & mask) | (srcRight & ~mask);
想象一下当我们使用这个(未固定的)代码和 srcLeft.even = (int4)(7,7, 5,5)
时会发生什么。操作 srcLeft.even < srcRight.even
会产生 (int4)(0,0,0,0)
,然后我们用 imask0
屏蔽这个结果,我们会得到...... pseudomask = (int4)(0,-1,-1,0)
– 即 imask 本身。然而,这是错误的。
形成此模式需要 pseudomask
的值:(int4)(a,a, b,b)
(其中 a
和 b
可以是 0
或 -1
)。这意味着进行以下比较以形成正确的 mask
就足够了:quasimask = srcLeft.s07 < srcRight.s07
。然后正确的掩码将被创建为 mask = quasimask.xxxxyyyy
。前 2 个 x
es 掩码 4 元组的第一个 2 元组中的第一个键值对(4 元组 = theArray
中的一个元素)。由于我们想要对相应的二元组进行位掩码(由 imask0
指定为 0
– -1
对),我们添加了另一个 xx
。我们对 4 元组中的第二个 2 元组进行类似的位掩码,这给我们留下了 yyyy
。
使用 imask11
进行位移的可视化示例
srcLeft: x y z w
< < < <
srcRight [relative to srcLeft]: y x w z
^ imask0: 0 -1 0 1
------------------------------------------
(srcLeft<srcRight)^imask0: x x z z
固定的、功能齐全的版本(我已经评论了固定的部分):
__kernel void BitonicSort(__global int8* theArray,
const uint stage,
const uint passOfStage,
const uint dir)
{
size_t i = get_global_id(0);
int8 srcLeft, srcRight, mask;
int4 pseudomask;
int4 imask10 = (int4)(0, 0, -1, -1);
int4 imask11 = (int4)(0, -1, 0, -1);
if(stage > 0)
{
if(passOfStage > 0) // upper level pass, exchange between two fours
{
size_t r = 1 << (passOfStage - 1);
size_t lmask = r - 1;
size_t left = ((i>>(passOfStage-1)) << passOfStage) + (i & lmask);
size_t right = left + r;
srcLeft = theArray[left];
srcRight = theArray[right];
pseudomask = srcLeft.even < srcRight.even;
mask = pseudomask.xxyyzzww; // here we interchange individual components, so no mask is applied and hence no 2 pairs must contain the same bit-pattern
int8 imin = (srcLeft & mask) | (srcRight & ~mask);
int8 imax = (srcLeft & ~mask) | (srcRight & mask);
if( ((i>>(stage-1)) & 1) ^ dir )
{
theArray[left] = imin;
theArray[right] = imax;
}
else
{
theArray[right] = imin;
theArray[left] = imax;
}
}
else // last pass, sort inside one four
{
srcLeft = theArray[i];
srcRight = srcLeft.s45670123;
pseudomask = (srcLeft.even < srcRight.even) ^ imask10;
mask = pseudomask.xxyyxxyy;
if(((i >> stage) & 1) ^ dir)
{
srcLeft = (srcLeft & mask) | (srcRight & ~mask);
srcRight = srcLeft.s23016745;
pseudomask = (srcLeft.even < srcRight.even) ^ imask11;
mask = pseudomask.xxxxzzzz; // the 0th and 1st elements must contain the exact same value (as well as 2nd and 3rd)
theArray[i] = (srcLeft & mask) | (srcRight & ~mask);
}
else
{
srcLeft = (srcLeft & ~mask) | (srcRight & mask);
srcRight = srcLeft.s23016745;
pseudomask = (srcLeft.even < srcRight.even) ^ imask11;
mask = pseudomask.xxxxzzzz; // the 0th and 1st elements must contain the exact same value (as well as 2nd and 3rd)
theArray[i] = (srcLeft & ~mask) | (srcRight & mask);
}
}
}
else // first stage, sort inside one four
{
int4 imask0 = (int4)(0, -1, -1, 0);
srcLeft = theArray[i];
srcRight = srcLeft.s23016745;
pseudomask = (srcLeft.even < srcRight.even) ^ imask0;
mask = pseudomask.xxxxwwww; // the 0th and 1st elements must contain the exact same value (as well as 2nd and 3rd)
if( dir )
srcLeft = (srcLeft & mask) | (srcRight & ~mask);
else
srcLeft = (srcLeft & ~mask) | (srcRight & mask);
srcRight = srcLeft.s45670123;
pseudomask = (srcLeft.even < srcRight.even) ^ imask10;
mask = pseudomask.xxyyxxyy; // the 0th and 2nd elements must contain the exact same value (as well as 1st and 3rd)
if((i & 1) ^ dir)
{
srcLeft = (srcLeft & mask) | (srcRight & ~mask);
srcRight = srcLeft.s23016745;
pseudomask = (srcLeft.even < srcRight.even) ^ imask11;
mask = pseudomask.xxxxzzzz; // the 0th and 1st elements must contain the exact same value (as well as 2nd and 3rd)
theArray[i] = (srcLeft & mask) | (srcRight & ~mask);
}
else
{
srcLeft = (srcLeft & ~mask) | (srcRight & mask);
srcRight = srcLeft.s23016745;
pseudomask = (srcLeft.even < srcRight.even) ^ imask11;
mask = pseudomask.xxxxzzzz; // the 0th and 1st elements must contain the exact same value (as well as 2nd and 3rd)
theArray[i] = (srcLeft & ~mask) | (srcRight & mask);
}
}
}
关于c++ - 键/值数组的双调排序,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/38571955/
#include using namespace std; class C{ private: int value; public: C(){ value = 0;
这个问题已经有答案了: What is the difference between char a[] = ?string?; and char *p = ?string?;? (8 个回答) 已关闭
关闭。此题需要details or clarity 。目前不接受答案。 想要改进这个问题吗?通过 editing this post 添加详细信息并澄清问题. 已关闭 7 年前。 此帖子已于 8 个月
除了调试之外,是否有任何针对 c、c++ 或 c# 的测试工具,其工作原理类似于将独立函数复制粘贴到某个文本框,然后在其他文本框中输入参数? 最佳答案 也许您会考虑单元测试。我推荐你谷歌测试和谷歌模拟
我想在第二台显示器中移动一个窗口 (HWND)。问题是我尝试了很多方法,例如将分辨率加倍或输入负值,但它永远无法将窗口放在我的第二台显示器上。 关于如何在 C/C++/c# 中执行此操作的任何线索 最
我正在寻找 C/C++/C## 中不同类型 DES 的现有实现。我的运行平台是Windows XP/Vista/7。 我正在尝试编写一个 C# 程序,它将使用 DES 算法进行加密和解密。我需要一些实
很难说出这里要问什么。这个问题模棱两可、含糊不清、不完整、过于宽泛或夸夸其谈,无法以目前的形式得到合理的回答。如需帮助澄清此问题以便重新打开,visit the help center . 关闭 1
有没有办法强制将另一个 窗口置于顶部? 不是应用程序的窗口,而是另一个已经在系统上运行的窗口。 (Windows, C/C++/C#) 最佳答案 SetWindowPos(that_window_ha
假设您可以在 C/C++ 或 Csharp 之间做出选择,并且您打算在 Windows 和 Linux 服务器上运行同一服务器的多个实例,那么构建套接字服务器应用程序的最明智选择是什么? 最佳答案 如
你们能告诉我它们之间的区别吗? 顺便问一下,有什么叫C++库或C库的吗? 最佳答案 C++ 标准库 和 C 标准库 是 C++ 和 C 标准定义的库,提供给 C++ 和 C 程序使用。那是那些词的共同
下面的测试代码,我将输出信息放在注释中。我使用的是 gcc 4.8.5 和 Centos 7.2。 #include #include class C { public:
很难说出这里问的是什么。这个问题是含糊的、模糊的、不完整的、过于宽泛的或修辞性的,无法以目前的形式得到合理的回答。如需帮助澄清此问题以便重新打开它,visit the help center 。 已关
我的客户将使用名为 annoucement 的结构/类与客户通信。我想我会用 C++ 编写服务器。会有很多不同的类继承annoucement。我的问题是通过网络将这些类发送给客户端 我想也许我应该使用
我在 C# 中有以下函数: public Matrix ConcatDescriptors(IList> descriptors) { int cols = descriptors[0].Co
我有一个项目要编写一个函数来对某些数据执行某些操作。我可以用 C/C++ 编写代码,但我不想与雇主共享该函数的代码。相反,我只想让他有权在他自己的代码中调用该函数。是否可以?我想到了这两种方法 - 在
我使用的是编写糟糕的第 3 方 (C/C++) Api。我从托管代码(C++/CLI)中使用它。有时会出现“访问冲突错误”。这使整个应用程序崩溃。我知道我无法处理这些错误[如果指针访问非法内存位置等,
关闭。这个问题不符合Stack Overflow guidelines .它目前不接受答案。 我们不允许提问寻求书籍、工具、软件库等的推荐。您可以编辑问题,以便用事实和引用来回答。 关闭 7 年前。
已关闭。此问题不符合Stack Overflow guidelines 。目前不接受答案。 要求我们推荐或查找工具、库或最喜欢的场外资源的问题对于 Stack Overflow 来说是偏离主题的,因为
我有一些 C 代码,将使用 P/Invoke 从 C# 调用。我正在尝试为这个 C 函数定义一个 C# 等效项。 SomeData* DoSomething(); struct SomeData {
这个问题已经有答案了: Why are these constructs using pre and post-increment undefined behavior? (14 个回答) 已关闭 6
我是一名优秀的程序员,十分优秀!