c++ - 在这种情况下，为什么 PPL 比顺序循环和 OpenMP 慢得多-6ren

c++ - 在这种情况下，为什么 PPL 比顺序循环和 OpenMP 慢得多

转载作者：太空狗更新时间：2023-10-29 20:59:27

25

4

进一步my question on CodeReview ，我想知道为什么 PPL 使用 std::plus<int> 实现两个 vector 的简单变换比顺序 std::transform 慢得多并在 OpenMP 中使用 for 循环(顺序(带矢量化):25 毫秒，顺序(无矢量化):28 毫秒，C++AMP:131 毫秒，PPL:51 毫秒，OpenMP:24 毫秒)。

我使用以下代码进行分析，并在 Visual Studio 2013 中进行了全面优化编译:

#include <amp.h>
#include <iostream>
#include <numeric>
#include <random>
#include <assert.h>
#include <functional>
#include <chrono>

using namespace concurrency;

const std::size_t size = 30737418;

//----------------------------------------------------------------------------
// Program entry point.
//----------------------------------------------------------------------------
int main( )
{
    accelerator default_device;
    std::wcout << "Using device : " << default_device.get_description( ) << std::endl;
    if( default_device == accelerator( accelerator::direct3d_ref ) )
        std::cout << "WARNING!! Running on very slow emulator! Only use this accelerator for debugging." << std::endl;

    std::mt19937 engine;
    std::uniform_int_distribution<int> dist( 0, 10000 );

    std::vector<int> vecTest( size );
    std::vector<int> vecTest2( size );
    std::vector<int> vecResult( size );

    for( int i = 0; i < size; ++i )
    {
        vecTest[i] = dist( engine );
        vecTest2[i] = dist( engine );
    }

    std::vector<int> vecCorrectResult( size );

    std::chrono::high_resolution_clock clock;
    auto beginTime = clock.now();

    std::transform( std::begin( vecTest ), std::end( vecTest ), std::begin( vecTest2 ), std::begin( vecCorrectResult ), std::plus<int>() );

    auto endTime = clock.now();
    auto timeTaken = endTime - beginTime;

    std::cout << "The time taken for the sequential function to execute was: " << std::chrono::duration_cast<std::chrono::milliseconds>(timeTaken).count() << "ms" << std::endl;

    beginTime = clock.now();

#pragma loop(no_vector)
    for( int i = 0; i < size; ++i )
    {
        vecResult[i] = vecTest[i] + vecTest2[i];
    }

    endTime = clock.now();
    timeTaken = endTime - beginTime;

    std::cout << "The time taken for the sequential function (with auto-vectorization disabled) to execute was: " << std::chrono::duration_cast<std::chrono::milliseconds>(timeTaken).count() << "ms" << std::endl;

    beginTime = clock.now();

    concurrency::array_view<const int, 1> av1( vecTest );
    concurrency::array_view<const int, 1> av2( vecTest2 );
    concurrency::array_view<int, 1> avResult( vecResult );
    avResult.discard_data();

    concurrency::parallel_for_each( avResult.extent, [=]( concurrency::index<1> index ) restrict(amp) {
        avResult[index] = av1[index] + av2[index];
    } );

    avResult.synchronize();
    endTime = clock.now();
    timeTaken = endTime - beginTime;

    std::cout << "The time taken for the AMP function to execute was: " << std::chrono::duration_cast<std::chrono::milliseconds>(timeTaken).count() << "ms" << std::endl;
    std::cout << std::boolalpha << "The AMP function generated the correct answer: " << (vecResult == vecCorrectResult) << std::endl;

    beginTime = clock.now();

    concurrency::parallel_transform( std::begin( vecTest ), std::end( vecTest ), std::begin( vecTest2 ), std::begin( vecResult ), std::plus<int>() );

    endTime = clock.now();
    timeTaken = endTime - beginTime;

    std::cout << "The time taken for the PPL function to execute was: " << std::chrono::duration_cast<std::chrono::milliseconds>(timeTaken).count() << "ms" << std::endl;
    std::cout << "The PPL function generated the correct answer: " << (vecResult == vecCorrectResult) << std::endl;

    beginTime = clock.now();

#pragma omp parallel
#pragma omp for
    for( int i = 0; i < size; ++i )
    {
        vecResult[i] = vecTest[i] + vecTest2[i];
    }

    endTime = clock.now();
    timeTaken = endTime - beginTime;

    std::cout << "The time taken for the OpenMP function to execute was: " << std::chrono::duration_cast<std::chrono::milliseconds>(timeTaken).count() << "ms" << std::endl;
    std::cout << "The OpenMP function generated the correct answer: " << (vecResult == vecCorrectResult) << std::endl;

    return 0;
}

最佳答案

根据 MSDN，concurrency::parallel_transform 的默认分区器是 concurrency::auto_partitioner .说到它:

This method of partitioning employes range stealing for load balancing as well as per-iterate cancellation.

使用此分区器对于简单(和内存受限)的操作(例如对两个数组求和)来说是一种矫枉过正，因为开销很大。您应该改为使用 concurrency::static_partitioner。当 for 构造中缺少 schedule 子句时，静态分区正是大多数 OpenMP 实现默认使用的。

正如 Code Review 中已经提到的，这是一个非常依赖内存的代码。它也是 STREAM benchmark 的 SUM 内核，专门用于测量其运行的系统的内存带宽。 a[i] = b[i] + c[i] 操作具有非常低的操作强度(以 OPS/字节为单位)，其速度完全由主存总线的带宽决定.这就是为什么 OpenMP 代码和矢量化串行代码提供基本相同的性能，并不比非矢量化串行代码的性能高多少。

获得更高并行性能的方法是在现代多套接字系统上运行代码，并让每个数组中的数据均匀分布在套接字上。然后您可以获得几乎等于 CPU 插槽数量的加速。

关于c++ - 在这种情况下，为什么 PPL 比顺序循环和 OpenMP 慢得多，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/24594454/

25

4

0

文章推荐： c++ - 使用 std::less 创建一个 std::map 环绕原点

文章推荐： python - Python 中的 Excel RTD 服务器不更新数据

文章推荐： c++ - 将 rmultinom 与 Rcpp 结合使用

c++ - ppl，如何正确使用它？
以下代码: #include int i; vector val(10),summ(10,0); for(i=0;i vett(1000); double vall=val[y];
C++ PPL - 初始化可组合
假设在封闭范围内，我有一些变量，parallel_for 循环中的每个线程都应该访问这些变量。我有一个 combinable 适合的想法，在每个线程中制作我的变量的一个拷贝。但是，我不明白如何初始化我
c++ - PPL:线程池的初始化
是否有预初始化 PPL 线程池的标准方法？问题是:PPL 在运行时创建它的线程池，例如parallel_for() 正在执行。由于创建了额外的线程，这在第一次运行期间会消耗一点性能。为了说明问题，
c++ - PPL - 许可证和链接信息
我想将 PPL 与编译器 VS2010 一起使用。 PPL 的许可证状态是什么？能否在商业软件中自由使用，能否指点相关文档？它如何绑定(bind)到适当的编译或它如何工作？我是否必须下载单独的库，或勾
c++ - ppl 中的任务执行属性
C++ ppl 库中新创建的任务是否自动执行，或者是否需要任何机制来启动上述任务的执行？最佳答案任务立即安排。 concurrency::task构造函数调用 _TaskInitMaybeFunc
c++ - 使用 PPL 查找数组中的最大元素
我需要实现一个函数，使用 ppl.h 查找 float 组中的最大元素。我有这个代码，基于 this answer : float find_largest_element_in_matrix_PP
c++ - 终止 PPL 线程池中的线程
Microsoft 的 PPL 库包含强大的并行化概念，并使用线程池实现它们，因此在运行 PPL 任务时通常不会创建新线程。但是，似乎没有一种方法可以显式停止线程池中的线程。我想明确停止线程的原因是
C++ PPL - lambda 表达式和数据共享
我的 PPL 程序崩溃了。我确实怀疑某些变量处理不当。如果我的 parallel_for 构造语法是 parallel_for(0,p,[&x1Pt,&x2Pt,&confciInput,&formu
c++ - 使用并行模式库 (ppl.h)
我正在尝试学习如何在 c++ 中使用 ppl.h。但我不确定我应该在 VS2010 中创建什么样的解决方案来使用它。如果我在没有 CLR 的情况下创建 Win32 控制台应用程序，则无法识别“并发”，
c++ - PPL 任务中 .then 构造的目的是什么？
我很感兴趣在 PPL 中构建 .then 的目的是什么，我可以在哪里测试它？似乎 Visual Studio 2012 还不支持它(可能是 future 的 CTP？)。它在标准 C++11 异步库中
c++ - PPL when_all 具有不同类型的任务？
我想在不同类型的任务上使用 PPL“when_all”。并为该任务添加一个“then”调用。但是 when_all 返回采用 vector 的任务，因此所有元素必须是同一类型。那么我该怎么做呢？这
c++ - 为什么类成员变量不允许在 PPL 中为 [ &A, &B ]
编译之前VS说错误成员“test::A”不是变量错误成员“test::B”不是变量代码: #include #include using namespace concurrency; usi
c++ - PPL 任务何时在 UI 线程上执行？
在调用 create_task 时有没有办法确保任务不在 UI 线程上运行？我想确保我不会无意中在一个以某种方式设法在 UI 线程上执行的任务中调用等待。最佳答案 create_task 函数不会
使用 PPL 不锁定临界区的 C++ 并行循环
在下面的代码中，有一个用 PPL 实现的 parallel_for 循环。主要问题就在这里；当我评论 cs.lock() 和 cs.unlock() 时，abc vector 值不正确。我正在使用 c
返回 PPL 任务的 C++ 函数签名？
在 C++ 环境中使用 PPL 任务时，我完全是个菜鸟，所以我很难弄清楚以下 C# 代码的 C++ 语法是什么: private static async Task GetImageStreamRef
c++ - 使用 ppl.h 查找最大值
C++ 的 ppl 库中是否有一个简单的函数，您可以在其中执行类似 Concurrency::max(vec) 的操作，其中 vec 是数字 vector ？我可以自己写，但我希望我可以省去自己的工作
c++ - 具有 PPL 和并行内存分配的线程 ID
我有一个关于 Microsoft PPL 库和一般并行编程的问题。我正在使用 FFTW 执行大量 (100,000) 64 x 64 x 64 FFT 和逆 FFT。在我当前的实现中，我使用并行 fo
c++ - Concurrency::parallel_for (PPL) 创建了太多线程
我正在使用 Concurrency::parallel_for() Visual Studio 2010 的并行模式库 (PPL) 来处理一组索引任务(通常，索引集远大于可以同时运行的线程数)。每个任
c++ - 如何使用 Microsoft PPL 轻量级任务计划程序实现退避？
我们使用 PPL Concurrency::TaskScheduler 将事件从我们的媒体管道发送到订阅的客户端(通常是 GUI 应用程序)。这些事件是传递给 Concurrency::TaskSc
c++ - PPL - 如何配置 native 线程数？
我正在尝试使用其 Scheduler 类来管理 PPL 中的 native 线程数，这是我的代码: for (int i = 0; i RegisterShutdownEvent(hShutdownE

首页

博学

6Ren·AI

商城

c++ - 在这种情况下，为什么 PPL 比顺序循环和 OpenMP 慢得多