c++ - Cuda Thrust - 如何使用 sort_by_key、merge_by_key 和 reduce_by

c++ - Cuda Thrust - 如何使用 sort_by_key、merge_by_key 和 reduce_by_key 优化代码

转载作者：行者123 更新时间：2023-11-30 03:42:49

我正在使用c++和cuda/thrust在GPU上进行计算，这对我来说是一个新领域。不幸的是，我的代码(下面的 MCVE)效率不高，所以我想知道如何优化它。该代码执行以下操作:

有两个键 vector 和两个值 vector 。关键 vector 基本上包含上三角矩阵的 i 和 j(在本例中:大小为 4x4)。

key1 {0, 0, 0, 1, 1, 2} value1: {0.5, 0.5, 0.5, -1.0, -1.0, 2.0}
key2 {1, 2, 3, 2, 3, 3} value2: {-1, 2.0, -3.5, 2.0, -3.5, -3.5}

任务是对具有相同键的所有值求和。为此，我使用 sort_by_key 对第二个值 vector 进行了排序。结果是:

key1 {0, 0, 0, 1, 1, 2} value1: {0.5, 0.5, 0.5, -1.0, -1.0, 2.0}
key2 {1, 2, 2, 3, 3, 3} value2: {-1.0, 2.0, 2.0, -3.5, -3.5, -3.5}

之后，我使用 merge_by_key 合并了两个值 vector ，这导致了一个新的键和值 vector ，其大小是以前的两倍。

key_merge {0, 0, 0, 1, 1, 1, 2, 2, 2, 3, 3, 3}
value_merge {0.5, 0.5, 0.5, -1.0, -1.0, -1.0, 2.0, 2.0, 2.0, -3.5, -3.5, -3.5}

最后一步是使用 reduce_by_key 对具有相同键的所有值求和。结果是:

key {0, 1, 2, 3} value: {1.5, -3.0, 6.0, -10.5}

下面执行此操作的代码很慢，恐怕较大尺寸的性能会很差。如何优化？是否可以融合 sort_by_key、merge_by_key 和 reduce_by_key？由于我事先知道 sort_by_key 生成的键 vector ，是否可以将值 vector “从旧键转换为新键”？在减少它们之前合并两个 vector 是否有意义，或者对每对值/键 vector 分别使用 reduce_by_key 是否更快？是否可以利用以下事实来加快 reduce_by_key 的计算速度，即不同键值的数量已知且相等键的数量始终相同？

#include <stdio.h>
#include <thrust/host_vector.h>
#include <thrust/device_vector.h>
#include <thrust/sort.h>
#include <thrust/reduce.h>
#include <thrust/merge.h>

int main(){
   int key_1[6] = {0, 0, 0, 1, 1, 2};
   int key_2[6] = {1, 2, 3, 2, 3, 3};
   thrust::device_vector<double> k1(key_1,key_1+6);
   thrust::device_vector<double> k2(key_2,key_2+6);

   double value_1[6] = {0.5, 0.5, 0.5, -1.0, -1.0, 2.0};
   double value_2[6] = {-1, 2.0, -3.5, 2.0, -3.5, -3.5};
   thrust::device_vector<double> v1(value_1,value_1+6);
   thrust::device_vector<double> v2(value_2,value_2+6);

   thrust::device_vector<double> mk(12);
   thrust::device_vector<double> mv(12);
   thrust::device_vector<double> rk(4);
   thrust::device_vector<double> rv(4);

   thrust::sort_by_key(k2.begin(), k2.end(), v2.begin());
   thrust::merge_by_key(k1.begin(), k1.end(), k2.begin(), k2.end(),v1.begin(), v2.begin(), mk.begin(), mv.begin());
   thrust::reduce_by_key(mk.begin(), mk.end(), mv.begin(), rk.begin(), rv.begin());

   for (unsigned i=0; i<4; i++) {
     double tmp1 = rk[i];
     double tmp2 = rv[i];
     printf("key value %f is related to %f\n", tmp1, tmp2);
   }
   return 0;
}

结果:

key value 0.000000 is related to 1.500000
key value 1.000000 is related to -3.000000
key value 2.000000 is related to 6.000000
key value 3.000000 is related to -10.500000

最佳答案

我认为这是一种可能比您的序列更快的方法。关键思想是我们希望避免在我们提前知道顺序的情况下对数据进行排序。如果我们可以利用我们拥有的顺序知识，而不是对数据进行排序，我们可以简单地将其重新排序为所需的排列方式。

让我们对数据进行一些观察。如果您的 key1 和 key2 实际上是上三角矩阵的 i,j 索引，那么我们可以对串联进行一些观察这两个 vector :

串联 vector 将包含相等数量的每个键。 (我相信你可能已经在你的问题中指出了这一点。)所以在你的情况下， vector 将包含三个 0 键，三个 1 键，三个 2 键和三个 3 键。我相信这种模式应该适用于任何独立于矩阵维数的上三角模式。因此，维数为 N 的上三角矩阵在连接的索引 vector 中将有 N 组键，每组由 N-1 个相似的元素组成。
在串联 vector 中，我们可以发现/建立键的一致排序(基于矩阵维数 N)，这允许我们以类似键分组的顺序对 vector 重新排序，而无需诉诸传统排序操作。

如果我们结合以上 2 个想法，那么我们可能可以通过一些分散操作来代替排序/合并事件，然后是 thrust::reduce_by_key 操作来解决整个问题。分散操作可以通过 thrust::copy 到适当的 thrust::permutation_iterator 并结合适当的索引计算仿函数来完成。因为我们确切地知道重新排序的串联 key vector 会是什么样子(在您的 dimension-4 示例中:{0,0,0,1,1,1,2,2,2, 3,3,3})，我们不需要对其显式执行重新排序。但是，我们必须使用相同的映射对 value vector 重新排序。因此，让我们开发该映射的算法:

dimension (N=)4 example

vector index:          0, 1, 2, 3, 4, 5, 6, 7, 8, 9,10,11
desired (group) order: 0, 0, 0, 1, 1, 1, 2, 2, 2, 3, 3, 3
concatenated keys:     0, 0, 0, 1, 1, 2, 1, 2, 3, 2, 3, 3
group start idx:       0, 0, 0, 3, 3, 6, 3, 6, 9, 6, 9, 9
group offset idx:      0, 1, 2, 0, 1, 0, 2, 1, 0, 2, 1, 2
destination idx:       0, 1, 2, 3, 4, 6, 5, 7, 9, 8,10,11


dimension (N=)5 example

vector index:          0, 1, 2, 3, 4, 5, 6, 7, 8, 9,10,11,12,13,14,15,16,17,18,19
desired (group) order: 0, 0, 0, 0, 1, 1, 1, 1, 2, 2, 2, 2, 3, 3, 3, 3, 4, 4, 4, 4
concatenated keys:     0, 0, 0, 0, 1, 1, 1, 2, 2, 3, 1, 2, 3, 4, 2, 3, 4, 3, 4, 4
group start idx:       0, 0, 0, 0, 4, 4, 4, 8, 8,12, 4, 8,12,16, 8,12,16,12,16,16
group offset idx:      0, 1, 2, 3, 0, 1, 2, 0, 1, 0, 3, 2, 1, 0, 3, 2, 1, 3, 2, 3 
destination idx:       0, 1, 2, 3, 4, 5, 6,10, 7, 8,11,14, 9,12,15,17,13,16,18,19

我们可以观察到，在每种情况下，目标索引(即，对于所需的组顺序，将所选键或值移动到的位置)等于组起始索引加上组偏移索引。组开始索引只是 key 乘以 (N-1)。组偏移索引是一种类似于上三角或下三角索引模式的模式(在 2 个不同的化身中，用于连接 vector 的每一半)。串联键只是 key1 和 key2 vector 的串联(我们将使用 permutation_iterator 虚拟地创建这种串联)。所需的群阶是先验已知的，它只是一个整数群序列，其中 N 个群由 N-1 个元素组成。它等效于串联键 vector 的排序版本。因此，我们可以直接在仿函数中计算目标索引。

为了创建组偏移索引模式，我们可以减去您的两个键 vector (并再减去 1):

key2:                  1, 2, 3, 2, 3, 3
key1:                  0, 0, 0, 1, 1, 2
key1+1:                1, 1, 1, 2, 2, 3
p1 = key2-(key1+1):    0, 1, 2, 0, 1, 0
p2 = (N-2)-p1:         2, 1, 0, 2, 1, 2
grp offset idx=p1|p2:  0, 1, 2, 0, 1, 0, 2, 1, 0, 2, 1, 2

这是一个使用您的示例数据演示上述概念的完整示例:

$ cat t1133.cu
#include <thrust/host_vector.h>
#include <thrust/device_vector.h>
#include <thrust/reduce.h>
#include <thrust/copy.h>
#include <thrust/transform.h>
#include <thrust/iterator/transform_iterator.h>
#include <thrust/iterator/permutation_iterator.h>
#include <thrust/iterator/zip_iterator.h>
#include <thrust/iterator/counting_iterator.h>
#include <iostream>

// "triangular sort" index generator
struct idx_functor
{
  int n;
  idx_functor(int _n): n(_n) {};
  template <typename T>
  __host__ __device__
  int operator()(const T &t){
    int k1 = thrust::get<0>(t);
    int k2 = thrust::get<1>(t);
    int id = thrust::get<2>(t);
    int go,k;
    if (id < (n*(n-1))/2){ // first half
      go = k2-k1-1;
      k  = k1;
      }
    else { // second half
      go = n-k2+k1-1;
      k  = k2;
      }
    return k*(n-1)+go;
  }
};


const int N = 4;
using namespace thrust::placeholders;

int main(){
   // useful dimensions
   int d1 = N*(N-1);
   int d2 = d1/2;
   // iniitialize keys
   int key_1[] = {0, 0, 0, 1, 1, 2};
   int key_2[] = {1, 2, 3, 2, 3, 3};
   thrust::device_vector<int> k1(key_1, key_1+d2);
   thrust::device_vector<int> k2(key_2, key_2+d2);
   // initialize values
   double value_1[] = {0.5, 0.5, 0.5, -1.0, -1.0, 2.0};
   double value_2[] = {-1, 2.0, -3.5, 2.0, -3.5, -3.5};
   thrust::device_vector<double> v(d1);
   thrust::device_vector<double> vg(d1);
   thrust::copy_n(value_1, d2, v.begin());
   thrust::copy_n(value_2, d2, v.begin()+d2);
   // reorder (group) values by key
   thrust::copy(v.begin(), v.end(), thrust::make_permutation_iterator(vg.begin(), thrust::make_transform_iterator(thrust::make_zip_iterator(thrust::make_tuple(thrust::make_permutation_iterator(k1.begin(), thrust::make_transform_iterator(thrust::counting_iterator<int>(0), _1%d2)), thrust::make_permutation_iterator(k2.begin(), thrust::make_transform_iterator(thrust::counting_iterator<int>(0), _1%d2)), thrust::counting_iterator<int>(0))), idx_functor(N))));
   // sum results
   thrust::device_vector<double> rv(N);
   thrust::device_vector<int> rk(N);
   thrust::reduce_by_key(thrust::make_transform_iterator(thrust::counting_iterator<int>(0), _1/(N-1)), thrust::make_transform_iterator(thrust::counting_iterator<int>(d1), _1/(N-1)), vg.begin(), rk.begin(), rv.begin());
   // print results
   std::cout << "Keys:" << std::endl;
   thrust::copy_n(rk.begin(), N, std::ostream_iterator<int>(std::cout, ", "));
   std::cout << std::endl <<  "Sums:" << std::endl;
   thrust::copy_n(rv.begin(), N, std::ostream_iterator<double>(std::cout, ", "));
   std::cout << std::endl;
   return 0;
}
$ nvcc -std=c++11 -o t1133 t1133.cu
$ ./t1133
Keys:
0, 1, 2, 3,
Sums:
1.5, -3, 6, -10.5,
$

最终效果是您的 thrust::sort_by_key 和 thrust::merge_by_key 操作已被单个 thrust::copy 取代操作效率应该更高。

关于c++ - Cuda Thrust - 如何使用 sort_by_key、merge_by_key 和 reduce_by_key 优化代码，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/36603828/

文章推荐： android - 将 createBitmap 从 Java 转换为 JNI

文章推荐： android - 三星的 Zirconia 许可检查器出错

文章推荐： java - 断言在 JUnit 中调用回调/监听器

c - thrust::reduce_by_key 性能，关键重复次数很少
我必须使用许多不同的键对数组进行键控缩减，这些键偶尔会重复一次: keys = {1,2,3,3,4,5,6,7,7, 8, 9, 9,10,11,...} array = {1,2,3,4,5,6
c++ - 推力/cuda reduce_by_key 错误？
我遇到了 thrust 库的 reduce_by_key 函数的问题。对我来说这看起来像是一个错误，但我想在报告之前确定一下。首先，我的设置:CUDA 7.0、Windows 8、NIVIDA Ge
c++ - CUDA 直方图 reduce_by_key 失败
我有以下 CUDA Thrust 代码，它使用 reduce_by_key 将值 [0, 1024) 的直方图绘制到 256 个桶中。我希望每个桶的计数 = 4，但我看到桶 0 有 256 个，桶 2
c++ - CUDA Thrust reduce_by_key 使用更少的内存
我正在尝试减少为我的用例计算 reduce_by_key 所需的内存。与值的数量(大约 1600 万)相比，我有相对较少的唯一键(大约 100-150)。按键减少 example显示分配用于包含结果的
c++ - 推力:reduce_by_key 将 zip_iterator(tuple) 传递给自定义仿函数以按键检索平均值
我想做的是通过 thrust::reduce_by_key 按键获取平均值.我先sort_by_key这对于 reduce_by_key 的连续键分组工作得很好.我用了this帮助我走到这一步。但是，
CUDA Thrust : reduce_by_key on only some values in an array, 基于 "key"数组中的值
假设我有两个 device_vector 数组，d_keys 和 d_data。如果 d_data 例如是一个扁平的 2D 3x5 数组(例如 { 1, 2, 3, 4, 5, 6, 7, 8, 9
c++ - Cuda Thrust - 如何使用 sort_by_key、merge_by_key 和 reduce_by_key 优化代码
我正在使用c++和cuda/thrust在GPU上进行计算，这对我来说是一个新领域。不幸的是，我的代码(下面的 MCVE)效率不高，所以我想知道如何优化它。该代码执行以下操作: 有两个键 vector

行者123

个人简介

我是一名优秀的程序员,十分优秀！

作者热门文章

滴滴打车优惠券免费领取

全站热门文章

首页

博学

6Ren·AI

商城

c++ - Cuda Thrust - 如何使用 sort_by_key、merge_by_key 和 reduce_by_key 优化代码