c++ - 键/值数组的双调排序-6ren

c++ - 键/值数组的双调排序

转载作者：搜寻专家更新时间：2023-10-31 01:34:48

我正在尝试修改对 cl_int 数组进行排序的 the Intel's Bitonic Sorting 算法，以对 cl_int2 数组进行排序(基于键 - 即 cl_int2.x )。

英特尔的示例包含一个简单的主机代码和一个 OpenCL 内核，该内核在一次排序操作(多 channel )期间被多次调用。内核一次加载 4 个数组项作为 cl_int4 并对它们进行操作。

我没有修改主机代码算法，只修改了设备代码。 内核函数的变化列表:

将第一个内核的参数类型从 int4* 修改为 int8*(以加载四个键值对)
仅使用 .even 元素的 theArray 组件来比较值 ( < )
创建“pseudomask ” ( int4 ) 并基于此创建 mask 作为 pseudomask.xxyyzzww (以捕获值)

尽管我修改后的内核的输出完全按照第一个组件 ( cl_int2 ) 排序 cl_int2.x 数组，但值 ( cl_int2.y ) 不正确——一个项目的值在接下来的 4 或 8 个项目中重复，然后使用并重复新值...

我确定有一个微不足道的错误，但我无法找到它。

Diff of the original Intel code and my modified version .

编辑:当每个键 ( `cl_int2` ) 是唯一的时，`cl_int2.x` 数组被完美排序。

示例输入:http://pastebin.com/92qB1csT

示例输出:http://pastebin.com/dsU97Npn

(正确排序的数组:http://pastebin.com/Nb56BuQK)

修改后的内核代码(注释):

// Copyright (c) 2009-2011 Intel Corporation
// https://software.intel.com/en-us/articles/bitonic-sorting

// Modified to sort int2 key-value array

__kernel void BitonicSort(__global int8* theArray,
                         const uint stage,
                         const uint passOfStage,
                         const uint dir)
{
    size_t i = get_global_id(0);
    int8 srcLeft, srcRight, mask;
    int4 pseudomask;
    int4 imask10 = (int4)(0,  0, -1, -1);
    int4 imask11 = (int4)(0, -1,  0, -1);

    if(stage > 0)
    {
        if(passOfStage > 0)    // upper level pass, exchange between two fours,
        {
            size_t r = 1 << (passOfStage - 1);
            size_t lmask = r - 1;
            size_t left = ((i>>(passOfStage-1)) << passOfStage) + (i & lmask);
            size_t right = left + r;

            srcLeft = theArray[left];
            srcRight = theArray[right];
            pseudomask = srcLeft.even < srcRight.even;
            mask = pseudomask.xxyyzzww;

            int8 imin = (srcLeft & mask) | (srcRight & ~mask);
            int8 imax = (srcLeft & ~mask) | (srcRight & mask);

            if( ((i>>(stage-1)) & 1) ^ dir )
            {
                theArray[left]  = imin;
                theArray[right] = imax;
            }
            else
            {
                theArray[right] = imin;
                theArray[left]  = imax;
            }
        }
        else    // last pass, sort inside one four
        {
            srcLeft = theArray[i];
            srcRight = srcLeft.s45670123;
            pseudomask = (srcLeft.even < srcRight.even) ^ imask10;
            mask = pseudomask.xxyyzzww;

            if(((i >> stage) & 1) ^ dir)
            {
                srcLeft = (srcLeft & mask) | (srcRight & ~mask);

                srcRight = srcLeft.s23016745;
                pseudomask = (srcLeft.even < srcRight.even) ^ imask11;
                mask = pseudomask.xxyyzzww;

                theArray[i] = (srcLeft & mask) | (srcRight & ~mask);
            }
            else
            {
                srcLeft = (srcLeft & ~mask) | (srcRight & mask);

                srcRight = srcLeft.s23016745;
                pseudomask = (srcLeft.even < srcRight.even) ^ imask11;
                mask = pseudomask.xxyyzzww;

                theArray[i] = (srcLeft & ~mask) | (srcRight & mask);
            }
        }
    }
    else    // first stage, sort inside one four
    {
        /*
         *  To convert this code to int2 sorter, do this:
         *      1. instead of loading int4, load int8 (key,value, key,value, ...)
         *      2. when there is a vector swizzling, replace component index with two consecutive indices:
         *           srcLeft.yxwz  ->  srcLeft.s23016745
         *         use this rewrite rule:
         *           x  y  z  w
         *           01 23 45 67
         *      3. replace comparison operands with only their keys swizzled:
         *           mask = srcLeft < srcRight;    ->    pseudomask = srcLeft.even < srcRight.even; mask = pseudomask.xxyyzzww;
         */

        //  make bitonic sequence out of 4.
        int4 imask0 = (int4)(0, -1, -1,  0); // -1 in comparison = true (all bits set - two's complement)
        srcLeft = theArray[i];
        srcRight = srcLeft.s23016745;

        /*
         * This XOR mask flips bits, so that in `mask` are the following
         * results (remember that srcRight is srcLeft with swapped component pairs):
         *
         *      [ left.x<left.y, left.x<left.y,    left.w<left.z, left.w<left.z  ]
         *  or: [ left.x<left.y, left.x<left.y,    left.z>left.w, left.z>left.w  ]
         */
        pseudomask = (srcLeft.even < srcRight.even) ^ imask0;
        mask = pseudomask.xxyyzzww;

        if( dir )
            srcLeft = (srcLeft & mask) | (srcRight & ~mask);  // make sure the numbers are sorted like this:
        else
            srcLeft = (srcLeft & ~mask) | (srcRight & mask);

        /*
         *  Now the pairs of numbers in `srcLeft` are sorted according to the specified `dir`ection.
         *  If dir == true, then
         *    The components `x` and `y` are swapped so that `x` < `y`. Moreover `z` and `w` are swapped so that `z` > `w`. This resembles up-hill: /\
         *  else
         *    The components `x` and `y` are swapped so that `x` > `y`. Moreover `z` and `w` are swapped so that `z` < `w`. This resembles down-hill: \/
         *
         *  This swapping is achieved by creating `srcLeft`, which is in normal order, and `srcRight`, which has component pairs switched (xyzw -> yxwz).
         *  Then the `mask` is created. The mask bits are redundant because it applies to vector component pairs (so in order to implement key-value sorting,
         *  I have to increase the length of masks!).
         *
         *  The non-ordered component pairs in `srcLeft` are masked out by `mask` while the inverted `mask` is applied to the (pair-wise switched) `srcRight`.
         *
         *  This (the previous) first flipping just makes a 4-bitonic sequence.
         */


        /*
         *  This second step just sorts the bitonic sequence
         */
        srcRight = srcLeft.s45670123; // inverts the bitonic sequence

        // [ left.a<left.c, left.b<left.d,    left.a<left.c, left.b<left.d ]
        pseudomask = (srcLeft.even < srcRight.even) ^ imask10;  // imask10 = (noflip, noflip,  flip, flip)
        mask = pseudomask.xxyyzzww;

        // even or odd (The output of this thread is sorted monotonic sequence. The monotonicity changes and thus preparing bitonic sequence for the next pass.).
        if((i & 1) ^ dir)
        {
            // this sorts the bitonic sequence, hence splitting it
            srcLeft = (srcLeft & mask) | (srcRight & ~mask);

            srcRight = srcLeft.s23016745;
            pseudomask = (srcLeft.even < srcRight.even) ^ imask11;
            mask = pseudomask.xxyyzzww;

            theArray[i] = (srcLeft & mask) | (srcRight & ~mask);
        }
        else
        {
            srcLeft = (srcLeft & ~mask) | (srcRight & mask);

            srcRight = srcLeft.s23016745;
            pseudomask = (srcLeft.even < srcRight.even) ^ imask11;
            mask = pseudomask.xxyyzzww;

            theArray[i] = (srcLeft & ~mask) | (srcRight & mask);
        }
    }
}

主机端代码:

void ExecuteSortKernel(cl_kernel kernel, cl_command_queue queue, cl_mem cl_input_buffer, cl_int arraySize, cl_uint sortAscending)
{
    cl_int numStages = 0;

    cl_int stage;
    cl_int passOfStage;

    for (cl_int temp = arraySize; temp > 2; temp >>= 1)
        numStages++;

    clSetKernelArg(kernel, 0, sizeof(cl_mem), (void *) &cl_input_buffer);
    clSetKernelArg(kernel, 3, sizeof(cl_uint), (void *) &sortAscending);

    for (stage = 0; stage < numStages; stage++) {
        clSetKernelArg(kernel, 1, sizeof(cl_uint), (void *) &stage);

        for (passOfStage = stage; passOfStage >= 0; passOfStage--) {
            clSetKernelArg(kernel, 2, sizeof(cl_uint), (void *) &passOfStage);

            // set work-item dimensions
            size_t gsz = arraySize / (2*4);
            size_t global_work_size[1] = { passOfStage ? gsz : gsz << 1 };    //number of quad items in input array

            // execute kernel
            clEnqueueNDRangeKernel(queue, kernel, 1, NULL, global_work_size, NULL, 0, NULL, NULL);
        }
    }
}

最佳答案

我终于解决了这个问题!

棘手的部分在于原始英特尔代码处理加载的 4 元组中相邻对的相等值的方式 — 它没有明确处理它!

错误存在于所有其他 stage 的最后一个 passOfStage(即 passOfStage = 0)中的第一个 stage 和中。这些代码部分在一个 4 元组(由 cl_int8 数组 theArray 表示)内交换各个 2 元组。

让我们以这个摘录为例(对于 4 元组中的相等相邻 2 元组，它不能正常工作):

imask0     = (int4)(0, -1, -1,  0);
srcLeft    = theArray[i];  // int8
srcRight   = srcLeft.s23016745;
pseudomask = (srcLeft.even < srcRight.even) ^ imask0;
mask       = pseudomask.xxyyzzww;
result     = (srcLeft & mask) | (srcRight & ~mask);

想象一下当我们使用这个(未固定的)代码和 srcLeft.even = (int4)(7,7, 5,5) 时会发生什么。操作 srcLeft.even < srcRight.even 会产生 (int4)(0,0,0,0) ，然后我们用 imask0 屏蔽这个结果，我们会得到...... pseudomask = (int4)(0,-1,-1,0) – 即 imask 本身。然而，这是错误的。

形成此模式需要 pseudomask 的值:(int4)(a,a, b,b)(其中 a 和 b 可以是 0 或 -1)。这意味着进行以下比较以形成正确的 mask 就足够了:quasimask = srcLeft.s07 < srcRight.s07。然后正确的掩码将被创建为 mask = quasimask.xxxxyyyy 。前 2 个 x es 掩码 4 元组的第一个 2 元组中的第一个键值对(4 元组 = theArray 中的一个元素)。由于我们想要对相应的二元组进行位掩码(由 imask0 指定为 0 – -1 对)，我们添加了另一个 xx 。我们对 4 元组中的第二个 2 元组进行类似的位掩码，这给我们留下了 yyyy 。

使用 imask11 进行位移的可视化示例

srcLeft:                        x  y  z  w
                                <  <  <  <
srcRight [relative to srcLeft]: y  x  w  z
^ imask0:                       0 -1  0  1
------------------------------------------
(srcLeft<srcRight)^imask0:      x  x  z  z

固定的、功能齐全的版本(我已经评论了固定的部分):

__kernel void BitonicSort(__global int8* theArray,
                         const uint stage,
                         const uint passOfStage,
                         const uint dir)
{
    size_t i = get_global_id(0);
    int8 srcLeft, srcRight, mask;
    int4 pseudomask;
    int4 imask10 = (int4)(0,  0, -1, -1);
    int4 imask11 = (int4)(0, -1,  0, -1);

    if(stage > 0)
    {
        if(passOfStage > 0)    // upper level pass, exchange between two fours
        {
            size_t r = 1 << (passOfStage - 1);
            size_t lmask = r - 1;
            size_t left = ((i>>(passOfStage-1)) << passOfStage) + (i & lmask);
            size_t right = left + r;

            srcLeft = theArray[left];
            srcRight = theArray[right];
            pseudomask = srcLeft.even < srcRight.even;
            mask = pseudomask.xxyyzzww; // here we interchange individual components, so no mask is applied and hence no 2 pairs must contain the same bit-pattern

            int8 imin = (srcLeft & mask) | (srcRight & ~mask);
            int8 imax = (srcLeft & ~mask) | (srcRight & mask);

            if( ((i>>(stage-1)) & 1) ^ dir )
            {
                theArray[left]  = imin;
                theArray[right] = imax;
            }
            else
            {
                theArray[right] = imin;
                theArray[left]  = imax;
            }
        }
        else    // last pass, sort inside one four
        {
            srcLeft = theArray[i];
            srcRight = srcLeft.s45670123;
            pseudomask = (srcLeft.even < srcRight.even) ^ imask10;
            mask = pseudomask.xxyyxxyy;

            if(((i >> stage) & 1) ^ dir)
            {
                srcLeft = (srcLeft & mask) | (srcRight & ~mask);

                srcRight = srcLeft.s23016745;
                pseudomask = (srcLeft.even < srcRight.even) ^ imask11;
                mask = pseudomask.xxxxzzzz; // the 0th and 1st elements must contain the exact same value (as well as 2nd and 3rd)

                theArray[i] = (srcLeft & mask) | (srcRight & ~mask);
            }
            else
            {
                srcLeft = (srcLeft & ~mask) | (srcRight & mask);

                srcRight = srcLeft.s23016745;
                pseudomask = (srcLeft.even < srcRight.even) ^ imask11;
                mask = pseudomask.xxxxzzzz; // the 0th and 1st elements must contain the exact same value (as well as 2nd and 3rd)

                theArray[i] = (srcLeft & ~mask) | (srcRight & mask);
            }
        }
    }
    else    // first stage, sort inside one four
    {
        int4 imask0 = (int4)(0, -1, -1,  0);
        srcLeft = theArray[i];
        srcRight = srcLeft.s23016745;

        pseudomask = (srcLeft.even < srcRight.even) ^ imask0;
        mask = pseudomask.xxxxwwww; // the 0th and 1st elements must contain the exact same value (as well as 2nd and 3rd)

        if( dir )
            srcLeft = (srcLeft & mask) | (srcRight & ~mask);
        else
            srcLeft = (srcLeft & ~mask) | (srcRight & mask);


        srcRight = srcLeft.s45670123;
        pseudomask = (srcLeft.even < srcRight.even) ^ imask10;
        mask = pseudomask.xxyyxxyy; // the 0th and 2nd elements must contain the exact same value (as well as 1st and 3rd)

        if((i & 1) ^ dir)
        {
            srcLeft = (srcLeft & mask) | (srcRight & ~mask);

            srcRight = srcLeft.s23016745;
            pseudomask = (srcLeft.even < srcRight.even) ^ imask11;
            mask = pseudomask.xxxxzzzz; // the 0th and 1st elements must contain the exact same value (as well as 2nd and 3rd)

            theArray[i] = (srcLeft & mask) | (srcRight & ~mask);
        }
        else
        {
            srcLeft = (srcLeft & ~mask) | (srcRight & mask);

            srcRight = srcLeft.s23016745;
            pseudomask = (srcLeft.even < srcRight.even) ^ imask11;
            mask = pseudomask.xxxxzzzz; // the 0th and 1st elements must contain the exact same value (as well as 2nd and 3rd)

            theArray[i] = (srcLeft & ~mask) | (srcRight & mask);
        }
    }
}

关于c++ - 键/值数组的双调排序，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/38571955/

文章推荐： c++ - 在二进制二维数组中查找唯一行

文章推荐： c++ - 将 C++ 类暴露给 QML

c++ int 数组，值为 2 维 int 数组(3d 数组)
我正在尝试创建一个包含 int[][] 项的数组即 int version0Indexes[][4] = { {1,2,3,4}, {5,6,7,8} }; int version1Indexes[
Java 数组[i]++ 与++数组[i]
我有一个整数数组: private int array[]; 如果我还有一个名为 add 的方法，那么以下有什么区别: public void add(int value) { array[va
JavaScript 数组 + 数组 = 字符串？
当您尝试在 JavaScript 中将一个数组添加到另一个数组时，它会将其转换为一个字符串。通常，当以另一种语言执行此操作时，列表会合并。 JavaScript [1, 2] + [3, 4] = "
数组
根据我正在阅读的教程，如果您想创建一个包含 5 列和 3 行的表格来表示这样的数据... 45 4 34 99 56 3 23 99 43 2 1 1 0 43 67 ...它说你可以使用下
数组
我通常使用 python 编写脚本/程序，但最近开始使用 JavaScript 进行编程，并且在使用数组时遇到了一些问题。在 python 中，当我创建一个数组并使用 for x in y 时，我得
数组 toString() 中的 javascript 数组
我有一个这样的数组: temp = [ 'data1', ['data1_a','data1_b'], ['data2_a','data2_b','data2_c'] ]; // 我想使用 toStr
php - 如何将秒表结果(数组)推送到第一个表结果(数组)
rent_property (table name) id fullName propertyName 1 A House Name1 2 B
C++ 数组 [索引] 与索引 [数组]
这个问题在这里已经有了答案: 关闭13年前。 Possible Duplicate: In C arrays why is this true? a[5] == 5[a] array[index] 和
excel - 将用户名(数组)与电子邮件(数组)匹配
使用 Excel 2013。经过多年的寻找和适应，我的第一篇文章。我正在尝试将当前 App 用户(即“John Smith”)与他的电子邮件地址“jsmith@work.com”进行匹配。使用两个
r - 3D 数组 -> 应用 -> 3D 数组
当仅在一个边距上操作时，apply 似乎不会重新组装 3D 数组。考虑: arr 1)，但对我来说仍然很奇怪，如果一个函数返回一个具有尺寸的对象，那么它们基本上会被忽略。最佳答案这是一个不太理
javascript - php 数组(数组)到 javascript
我有一个包含 GPS 坐标的 MySQL 数据库。这是我检索坐标的部分 PHP 代码； $sql = "SELECT lat, lon FROM gps_data"; $stmt=$db->query
python - 查找最后一个非零元素 3D 数组 - numpy 数组
我需要找到一种方法来执行这个操作，我有一个形状数组 [批量大小, 150, 1] 代表 batch_size 整数序列，每个序列有 150 个元素长，但在每个序列中都有很多添加的零，以使所有序列具有相
android - 如何在json中访问对象>数组>对象>数组>对象？
我必须通过 url 中的 json 获取文本。层次结构如下: 对象>数组>对象>数组>对象。我想用这段代码获取文本。但是我收到错误 :org.json.JSONException: No valu
cocoa - NSMutable NSArray 数组 - 如何避免所有这些行并使用维度或 3D 数组？
enter code here- (void)viewDidLoad { NSMutableArray *imageViewArray= [[NSMutableArray alloc] init];
java - 流式传输 2d 数组、修剪值并收集回 2d 数组
知道如何对二维字符串数组执行修剪操作，例如使用 Java 流 API 进行 3x3 并将其收集回相同维度的 3x3 数组？重点是避免使用显式的 for 循环。当前的解决方案只是简单地执行一个 fo
使用嵌套循环的 Java Union 数组 2 int 数组
已关闭。此问题需要 debugging details 。目前不接受答案。编辑问题以包含 desired behavior, a specific problem or error, and the
Jquery 与 JSON 数组 - 转换为 Javascript 数组
我有来自 ASP.NET Web 服务的以下 XML 输出: 1710 1711 1712 1713
javascript - 更新嵌套数组和对象中的对象。对象-->数组-->对象-->数组--> "object"
如果我有一个对象todo作为您状态的一部分，并且该对象包含数组列表，则列表内部有对象，在这些对象内部还有另一个数组listItems。如何更新数组 listItems 中 id 为“poi098”的对
c# - 如何在一个字节中转换 bool 数组，然后再转换回 bool 数组
我想将最大长度为 8 的 bool 数组打包成一个字节，通过网络发送它，然后将其解压回 bool 数组。已经在这里尝试了一些解决方案，但没有用。我正在使用单声道。我制作了 BitArray，然后尝试
c# - 将 char 数组/字符串转换为 bool 数组
我们的数据库中有这个字段指示一周中的每一天的真/假标志，如下所示:'1111110' 我需要将此值转换为 boolean 数组。为此，我编写了以下代码: char[] freqs = weekday

搜寻专家

个人简介

我是一名优秀的程序员,十分优秀！

作者热门文章

滴滴打车优惠券免费领取

全站热门文章

首页

博学

6Ren·AI

商城