multithreading - 将推力与openmp一起使用: no substantial speed up obtained-6ren

multithreading - 将推力与openmp一起使用: no substantial speed up obtained

转载作者：行者123 更新时间：2023-12-03 13:00:13

33

4

我有兴趣将大部分使用Thrust GPU库编写的代码移植到多核CPU。值得庆幸的是，the website说推力代码可以与诸如OpenMP/Intel TBB之类的线程环境一起使用。

我在下面编写了一个简单的代码，用于对大型数组进行排序，以使用支持多达16个Open MP线程的计算机查看加速。

在此机器上获得的用于排序大小为1600万的随机数组的时间为

STL:1.47秒
推力(16线程):1.21 s

似乎几乎没有任何提速。我想知道如何像使用GPU一样大幅提高使用OpenMP对数组进行排序的速度。

代码在下面(文件sort.cu)。编译执行如下:

nvcc -O2 -o sort sort.cu -Xcompiler -fopenmp -DTHRUST_DEVICE_SYSTEM = THRUST_DEVICE_BACKEND_OMP -lgomp

NVCC版本为5.5
正在使用的Thrust库版本是v1.7.0

#include <iostream>
#include <iomanip>
#include <cmath>
#include <cstdlib>
#include <stdio.h>
#include <algorithm>    
#include <ctime>
#include <time.h>
#include "thrust/sort.h"    

int main(int argc, char *argv[])
{
  int N = 16000000;
  double* myarr = new double[N];

  for (int i = 0; i < N; ++i)
    {
      myarr[i] = (1.0*rand())/RAND_MAX; 
     }
  std::cout << "-------------\n";

  clock_t start,stop;
  start=clock();
  std::sort(myarr,myarr+N);
  stop=clock();

  std::cout << "Time taken for sorting the array with STL  is " << (stop-start)/(double)CLOCKS_PER_SEC;

  //--------------------------------------------

  srand(1);
  for (int i = 0; i < N; ++i)
    {
      myarr[i] = (1.0*rand())/RAND_MAX; 
      //std::cout << myarr[i] << std::endl;
     }

  start=clock();
  thrust::sort(myarr,myarr+N);
  stop=clock();

  std::cout << "------------------\n";


  std::cout << "Time taken for sorting the array with Thrust  is " << (stop-start)/(double)CLOCKS_PER_SEC;
  return 0;
}

最佳答案

device backend refers to the behavior of operations performed on a thrust::device_vector或类似的引用。 Thrust将您要传递的数组/指针解释为主机指针，并对其执行基于主机的操作，这些操作不受设备后端设置的影响。

有多种方法可以解决此问题。如果您阅读了设备后端文档，则将找到常规示例和特定于omp的示例。我认为，您甚至可以指定其他host backend，该代码应具有所需的行为(OMP使用)。

解决此问题后，您可能会得到其他结果惊喜:推力似乎可以快速对数组进行排序，但是执行时间非常长。我相信这是由于the clock() function being affected by the number of OMP threads in use(无论如何在Linux上)。

下面的代码/示例运行解决了这些问题，似乎使我4个线程的速度提高了约3倍。

$ cat t592.cu
#include <iostream>
#include <iomanip>
#include <cmath>
#include <cstdlib>
#include <stdio.h>
#include <algorithm>
#include <ctime>
#include <sys/time.h>
#include <time.h>
#include <thrust/device_ptr.h>
#include <thrust/sort.h>

int main(int argc, char *argv[])
{
  int N = 16000000;
  double* myarr = new double[N];

  for (int i = 0; i < N; ++i)
    {
      myarr[i] = (1.0*rand())/RAND_MAX;
     }
  std::cout << "-------------\n";

  timeval t1, t2;
  gettimeofday(&t1, NULL);
  std::sort(myarr,myarr+N);
  gettimeofday(&t2, NULL);
  float et = (((t2.tv_sec*1000000)+t2.tv_usec)-((t1.tv_sec*1000000)+t1.tv_usec))/float(1000000);

  std::cout << "Time taken for sorting the array with STL  is " << et << std::endl;;

  //--------------------------------------------

  srand(1);
  for (int i = 0; i < N; ++i)
    {
      myarr[i] = (1.0*rand())/RAND_MAX;
      //std::cout << myarr[i] << std::endl;
     }
  thrust::device_ptr<double> darr = thrust::device_pointer_cast<double>(myarr);
  gettimeofday(&t1, NULL);
  thrust::sort(darr,darr+N);
  gettimeofday(&t2, NULL);
  et = (((t2.tv_sec*1000000)+t2.tv_usec)-((t1.tv_sec*1000000)+t1.tv_usec))/float(1000000);

  std::cout << "------------------\n";


  std::cout << "Time taken for sorting the array with Thrust  is " << et << std::endl   ;
  return 0;
}

$ nvcc -O2 -o t592 t592.cu -Xcompiler -fopenmp -DTHRUST_DEVICE_SYSTEM=THRUST_DEVICE_BACKEND_OMP -lgomp
$ OMP_NUM_THREADS=4 ./t592
-------------
Time taken for sorting the array with STL  is 1.31956
------------------
Time taken for sorting the array with Thrust  is 0.468176
$

你的旅费可能会改变。特别是，当您使用4个线程以上时，可能看不到任何改善。可能有许多因素会阻止OMP代码扩展到超过一定数量的线程。排序通常是一种受内存限制的算法，因此您可能会观察到增加，直到您使内存子系统达到饱和，然后再没有其他核心增加了。根据您的系统，您可能已经处在这种情况下，在这种情况下，您可能看不到OMP样式多线程的任何改进。

关于multithreading - 将推力与openmp一起使用: no substantial speed up obtained，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/26432462/

33

4

0

文章推荐： multithreading - D程序中的并发

文章推荐： multithreading - 排队任务时减少堆分配数

文章推荐： c# - 处理 WaitOne 函数卡住的定时器回调

cuda - 推力:填充编译错误
我需要一些帮助来追踪 thrust::fill 给我的编译错误。代码没有问题: line 9 #include // needed for other thrus
cuda - 推力:如何返回事件数组元素的索引
如何使用推力返回事件数组元素的索引，即返回数组元素等于 1 的索引向量？对此进行扩展，在给定数组维度的多维索引的情况下，这将如何工作？编辑:目前该功能看起来像这样 template void Vo
c++ - 推力::device_vector的结构抛出总线错误
当尝试创建thrust::device_vector的struct时，我得到了Bus error (core dumped)。奇怪的是，下面的代码在我的笔记本电脑(Quadro P2000)上运行良好
c++ - 推力::主机执行策略的段错误
我尝试将数据从主机复制到设备并返回，但不是使用 CUDA API，而是使用推力库。我在 thrust::host_vector 中分配了内存，并尝试将其复制到 thrust::device_vecto
cuda - 推力:删除键值数组中的重复项
我有一对大小相等的数组，我将它们称为键和值。例如: K: V 1: 99 1: 100 1: 100 1: 100 1: 103 2: 103 2: 105 3: 45 3: 67 键被排序，与每个
c++ - 推力即时按键排序还是不同的方法？
我想知道是否可以使用 Thrust 库按键排序，而无需创建 Vector 来存储键(动态)。例如，我有以下两个 vector :键和值: vectorKeys: 0, 1, 2, 0,
c++ - 推力:如何有意避免将参数传递给算法？
假设我想做一个 thrust::reduce_by_key 但我不关心输出键是什么。有没有一种方法可以通过某种方式将空对象(可能是空指针)传递给该参数的算法，从而不会创建毫无意义的输出键列表，从而节省
sorting - 推力::sort_by_key:如何将结果存储在单独的数组中？
我目前正在通过以下方式按键对值进行排序 thrust::sort_by_key(thrust::device_ptr(keys), thrust::device
cuda - 推力:如何从主机阵列创建 device_vector？
这个问题在这里已经有了答案: is there a better and a faster way to copy from CPU memory to GPU using thrust? (1 个回
c++ - 推力 vector 指针声明
有没有办法在不实际分配 vector 的情况下声明推力 vector 指针？我需要将此指针用作类中的成员变量。因为我事先并不知道 vector 的大小，所以我不能将 vector 静态分配为成员变量。
c++ - 推力 set_intersection 是如何工作的？
我想知道如何 thrust::set_intersection有效，但从我的测试结果来看，我对这个函数的作用更加困惑。举几个例子: const int size1 = 5; const int si
c++ - 推力 vector 距离计算
考虑以下数据集和质心。一共有7个人，两个均值有8个维度。它们按行主要顺序存储。 short dim = 8; float centroids[] = { 0.223, 0.002, 0.223
使用 double2 阵列减少 CUDA 推力
我有以下(可编译和可执行)代码，使用 CUDA Thrust 来执行 float2 数组的缩减。它工作正常 using namespace std; // includes, system #incl
cuda - 多 GPU CUDA 推力
我有一个使用 Thrust 目前在单个 GPU 上正常工作的 Cuda C++ 代码。我现在想为多 GPU 修改它。我有一个主机函数，其中包括许多对设备数组进行排序、复制、计算差异等的推力调用。我想使
c++ - 推力 vector 切片/ View
我在 thrust::device_vector 中有一个矩阵(面向行) .有什么方法可以获取该 vector 的切片/ View (也属于 thrust::device_vector 类型)？我对复
c++ - 推力/cuda reduce_by_key 错误？
我遇到了 thrust 库的 reduce_by_key 函数的问题。对我来说这看起来像是一个错误，但我想在报告之前确定一下。首先，我的设置:CUDA 7.0、Windows 8、NIVIDA Ge
c++ - 推力:不支持运算符 '*'
我有以下函数，用于用从 -time/2 到 time/2 的步长和步长 dt 填充 vector t: #define THRUST_PREC thrust::complex __host__ voi
C++ CUDA 推力 vector 多态性
在我现在正在编写的程序中，我想使用 GPU 或 CPU 进行计算(用于对彼此进行基准测试)。为此，我想要一些通用指针，我可以像这样使用 device_vector 或 host_vector 的实例对
cuda - 推力::device_ptr 没有成员 'begin'
我试图找到数组中的最小元素: thrust::device_ptr devPtr(d_ary); int minPos = thrust::min_element(devPtr.begin(),
推力 : how to implement priority queue 上的 CUDA
我的计划是使用 Pearsons 相关性计算距离矩阵，并从距离矩阵中为每个节点 (q=ln(n)) 获取 q-最近邻，并将它们放入结果向量中。我在 C++ 中使用相关函数循环内的 STL 优先级队列来

首页

博学

6Ren·AI

商城

multithreading - 将推力与openmp一起使用: no substantial speed up obtained