sorting - CUDA 推力和 sort_by

sorting - CUDA 推力和 sort_by_key

转载作者：行者123 更新时间：2023-12-01 04:00:59

24

4

我正在 CUDA 上寻找一种排序算法，该算法可以对元素数组 A 进行排序( double )并返回该数组 A 的键 B 数组。
我知道sort_by_key Thrust 库中的函数，但我希望我的元素数组 A 保持不变。
我能做什么？

我的代码是:

void sortCUDA(double V[], int P[], int N) {

        real_t *Vcpy = (double*) malloc(N*sizeof(double));
        memcpy(Vcpy,V,N*sizeof(double));

        thrust::sort_by_key(V, V + N, P);
        free(Vcpy);
}

我正在将推力算法与我在顺序 CPU 上的其他算法进行比较

N               mergesort       sortCUDA
113             0.000008        0.000010
226             0.000018        0.000016
452             0.000036        0.000020
905             0.000061        0.000034
1810            0.000135        0.000071
3621            0.000297        0.000156
7242            0.000917        0.000338
14484           0.001421        0.000853
28968           0.003069        0.001931
57937           0.006666        0.003939
115874          0.014435        0.008025
231749          0.031059        0.016718
463499          0.067407        0.039848
926999          0.148170        0.118003
1853998         0.329005        0.260837
3707996         0.731768        0.544357
7415992         1.638445        1.073755
14831984        3.668039        2.150179
115035495       39.276560       19.812200
230070990       87.750377       39.762915
460141980       200.940501      74.605219

推力性能还不错，但我想如果我使用 OMP 可能可以轻松获得更好的 CPU 时间

我认为这是因为 memcpy

解决方案:

void thrustSort(double V[], int P[], int N)
{
        thrust::device_vector<int> d_P(N);
        thrust::device_vector<double> d_V(V, V + N);
        thrust::sequence(d_P.begin(), d_P.end());

        thrust::sort_by_key(d_V.begin(), d_V.end(), d_P.begin());

        thrust::copy(d_P.begin(),d_P.end(),P);
}

其中 V 是我要排序的双值

最佳答案

您可以修改比较运算符以对键而不是值进行排序。 @Robert Crovella 正确指出无法从主机分配原始设备指针。修改后的算法如下:

struct cmp : public binary_function<int,int,bool>
{
  cmp(const double *ptr) : rawA(ptr) { }

  __host__ __device__ bool operator()(const int i, const int j) const 
  {return rawA[i] > rawA[j];}

   const double *rawA; // an array in global mem
}; 

void sortkeys(double *A, int n) {
  // move data to the gpu
  thrust::device_vector<double> devA(A, A + n);
  double *rawA = thrust::raw_pointer_cast(devA.data());

  thrust::device_vector<int> B(n);
  // initialize keys
  thrust::sequence(B.begin(), B.end());
  thrust::sort(B.begin(), B.end(), cmp(rawA));
  // B now contains the sorted keys
 }

这是arrayfire的替代方案。虽然我不确定哪一个更有效，因为 arrayfire 解决方案使用了两个额外的数组:

void sortkeys(double *A, int n) {
   af::array devA(n, A, af::afHost);
   af::array vals, indices;
   // sort and populate vals/indices arrays
   af::sort(vals, indices, devA);
   std::cout << devA << "\n" << indices << "\n";
}

关于sorting - CUDA 推力和 sort_by_key，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/13515308/

24

4

0

文章推荐： python - 在python3.x上启动scrapy项目时出现一些错误

文章推荐： python - 如何在均值漂移中查看集群成员

文章推荐： lua - Lua 编程第二版说 ".. is right associative"

cuda - 推力:填充编译错误
我需要一些帮助来追踪 thrust::fill 给我的编译错误。代码没有问题: line 9 #include // needed for other thrus
cuda - 推力:如何返回事件数组元素的索引
如何使用推力返回事件数组元素的索引，即返回数组元素等于 1 的索引向量？对此进行扩展，在给定数组维度的多维索引的情况下，这将如何工作？编辑:目前该功能看起来像这样 template void Vo
c++ - 推力::device_vector的结构抛出总线错误
当尝试创建thrust::device_vector的struct时，我得到了Bus error (core dumped)。奇怪的是，下面的代码在我的笔记本电脑(Quadro P2000)上运行良好
c++ - 推力::主机执行策略的段错误
我尝试将数据从主机复制到设备并返回，但不是使用 CUDA API，而是使用推力库。我在 thrust::host_vector 中分配了内存，并尝试将其复制到 thrust::device_vecto
cuda - 推力:删除键值数组中的重复项
我有一对大小相等的数组，我将它们称为键和值。例如: K: V 1: 99 1: 100 1: 100 1: 100 1: 103 2: 103 2: 105 3: 45 3: 67 键被排序，与每个
c++ - 推力即时按键排序还是不同的方法？
我想知道是否可以使用 Thrust 库按键排序，而无需创建 Vector 来存储键(动态)。例如，我有以下两个 vector :键和值: vectorKeys: 0, 1, 2, 0,
c++ - 推力:如何有意避免将参数传递给算法？
假设我想做一个 thrust::reduce_by_key 但我不关心输出键是什么。有没有一种方法可以通过某种方式将空对象(可能是空指针)传递给该参数的算法，从而不会创建毫无意义的输出键列表，从而节省
sorting - 推力::sort_by_key:如何将结果存储在单独的数组中？
我目前正在通过以下方式按键对值进行排序 thrust::sort_by_key(thrust::device_ptr(keys), thrust::device
cuda - 推力:如何从主机阵列创建 device_vector？
这个问题在这里已经有了答案: is there a better and a faster way to copy from CPU memory to GPU using thrust? (1 个回
c++ - 推力 vector 指针声明
有没有办法在不实际分配 vector 的情况下声明推力 vector 指针？我需要将此指针用作类中的成员变量。因为我事先并不知道 vector 的大小，所以我不能将 vector 静态分配为成员变量。
c++ - 推力 set_intersection 是如何工作的？
我想知道如何 thrust::set_intersection有效，但从我的测试结果来看，我对这个函数的作用更加困惑。举几个例子: const int size1 = 5; const int si
c++ - 推力 vector 距离计算
考虑以下数据集和质心。一共有7个人，两个均值有8个维度。它们按行主要顺序存储。 short dim = 8; float centroids[] = { 0.223, 0.002, 0.223
使用 double2 阵列减少 CUDA 推力
我有以下(可编译和可执行)代码，使用 CUDA Thrust 来执行 float2 数组的缩减。它工作正常 using namespace std; // includes, system #incl
cuda - 多 GPU CUDA 推力
我有一个使用 Thrust 目前在单个 GPU 上正常工作的 Cuda C++ 代码。我现在想为多 GPU 修改它。我有一个主机函数，其中包括许多对设备数组进行排序、复制、计算差异等的推力调用。我想使
c++ - 推力 vector 切片/ View
我在 thrust::device_vector 中有一个矩阵(面向行) .有什么方法可以获取该 vector 的切片/ View (也属于 thrust::device_vector 类型)？我对复
c++ - 推力/cuda reduce_by_key 错误？
我遇到了 thrust 库的 reduce_by_key 函数的问题。对我来说这看起来像是一个错误，但我想在报告之前确定一下。首先，我的设置:CUDA 7.0、Windows 8、NIVIDA Ge
c++ - 推力:不支持运算符 '*'
我有以下函数，用于用从 -time/2 到 time/2 的步长和步长 dt 填充 vector t: #define THRUST_PREC thrust::complex __host__ voi
C++ CUDA 推力 vector 多态性
在我现在正在编写的程序中，我想使用 GPU 或 CPU 进行计算(用于对彼此进行基准测试)。为此，我想要一些通用指针，我可以像这样使用 device_vector 或 host_vector 的实例对
cuda - 推力::device_ptr 没有成员 'begin'
我试图找到数组中的最小元素: thrust::device_ptr devPtr(d_ary); int minPos = thrust::min_element(devPtr.begin(),
推力 : how to implement priority queue 上的 CUDA
我的计划是使用 Pearsons 相关性计算距离矩阵，并从距离矩阵中为每个节点 (q=ln(n)) 获取 q-最近邻，并将它们放入结果向量中。我在 C++ 中使用相关函数循环内的 STL 优先级队列来

首页

博学

6Ren·AI

商城

sorting - CUDA 推力和 sort_by_key