gpt4 book ai didi

c++ - 在这种情况下,为什么 PPL 比顺序循环和 OpenMP 慢得多

转载 作者:太空狗 更新时间:2023-10-29 20:59:27 25 4
gpt4 key购买 nike

进一步my question on CodeReview ,我想知道为什么 PPL 使用 std::plus<int> 实现两个 vector 的简单变换比顺序 std::transform 慢得多并在 OpenMP 中使用 for 循环(顺序(带矢量化):25 毫秒,顺序(无矢量化):28 毫秒,C++AMP:131 毫秒,PPL:51 毫秒,OpenMP:24 毫秒)。

我使用以下代码进行分析,并在 Visual Studio 2013 中进行了全面优化编译:

#include <amp.h>
#include <iostream>
#include <numeric>
#include <random>
#include <assert.h>
#include <functional>
#include <chrono>

using namespace concurrency;

const std::size_t size = 30737418;

//----------------------------------------------------------------------------
// Program entry point.
//----------------------------------------------------------------------------
int main( )
{
accelerator default_device;
std::wcout << "Using device : " << default_device.get_description( ) << std::endl;
if( default_device == accelerator( accelerator::direct3d_ref ) )
std::cout << "WARNING!! Running on very slow emulator! Only use this accelerator for debugging." << std::endl;

std::mt19937 engine;
std::uniform_int_distribution<int> dist( 0, 10000 );

std::vector<int> vecTest( size );
std::vector<int> vecTest2( size );
std::vector<int> vecResult( size );

for( int i = 0; i < size; ++i )
{
vecTest[i] = dist( engine );
vecTest2[i] = dist( engine );
}

std::vector<int> vecCorrectResult( size );

std::chrono::high_resolution_clock clock;
auto beginTime = clock.now();

std::transform( std::begin( vecTest ), std::end( vecTest ), std::begin( vecTest2 ), std::begin( vecCorrectResult ), std::plus<int>() );

auto endTime = clock.now();
auto timeTaken = endTime - beginTime;

std::cout << "The time taken for the sequential function to execute was: " << std::chrono::duration_cast<std::chrono::milliseconds>(timeTaken).count() << "ms" << std::endl;

beginTime = clock.now();

#pragma loop(no_vector)
for( int i = 0; i < size; ++i )
{
vecResult[i] = vecTest[i] + vecTest2[i];
}

endTime = clock.now();
timeTaken = endTime - beginTime;

std::cout << "The time taken for the sequential function (with auto-vectorization disabled) to execute was: " << std::chrono::duration_cast<std::chrono::milliseconds>(timeTaken).count() << "ms" << std::endl;

beginTime = clock.now();

concurrency::array_view<const int, 1> av1( vecTest );
concurrency::array_view<const int, 1> av2( vecTest2 );
concurrency::array_view<int, 1> avResult( vecResult );
avResult.discard_data();

concurrency::parallel_for_each( avResult.extent, [=]( concurrency::index<1> index ) restrict(amp) {
avResult[index] = av1[index] + av2[index];
} );

avResult.synchronize();
endTime = clock.now();
timeTaken = endTime - beginTime;

std::cout << "The time taken for the AMP function to execute was: " << std::chrono::duration_cast<std::chrono::milliseconds>(timeTaken).count() << "ms" << std::endl;
std::cout << std::boolalpha << "The AMP function generated the correct answer: " << (vecResult == vecCorrectResult) << std::endl;

beginTime = clock.now();

concurrency::parallel_transform( std::begin( vecTest ), std::end( vecTest ), std::begin( vecTest2 ), std::begin( vecResult ), std::plus<int>() );

endTime = clock.now();
timeTaken = endTime - beginTime;

std::cout << "The time taken for the PPL function to execute was: " << std::chrono::duration_cast<std::chrono::milliseconds>(timeTaken).count() << "ms" << std::endl;
std::cout << "The PPL function generated the correct answer: " << (vecResult == vecCorrectResult) << std::endl;

beginTime = clock.now();

#pragma omp parallel
#pragma omp for
for( int i = 0; i < size; ++i )
{
vecResult[i] = vecTest[i] + vecTest2[i];
}

endTime = clock.now();
timeTaken = endTime - beginTime;

std::cout << "The time taken for the OpenMP function to execute was: " << std::chrono::duration_cast<std::chrono::milliseconds>(timeTaken).count() << "ms" << std::endl;
std::cout << "The OpenMP function generated the correct answer: " << (vecResult == vecCorrectResult) << std::endl;

return 0;
}

最佳答案

根据 MSDN,concurrency::parallel_transform 的默认分区器是 concurrency::auto_partitioner .说到它:

This method of partitioning employes range stealing for load balancing as well as per-iterate cancellation.

使用此分区器对于简单(和内存受限)的操作(例如对两个数组求和)来说是一种矫枉过正,因为开销很大。您应该改为使用 concurrency::static_partitioner。当 for 构造中缺少 schedule 子句时,静态分区正是大多数 OpenMP 实现默认使用的。

正如 Code Review 中已经提到的,这是一个非常依赖内存的代码。它也是 STREAM benchmarkSUM 内核,专门用于测量其运行的系统的内存带宽。 a[i] = b[i] + c[i] 操作具有非常低的操作强度(以 OPS/字节为单位),其速度完全由主存总线的带宽决定.这就是为什么 OpenMP 代码和矢量化串行代码提供基本相同的性能,并不比非矢量化串行代码的性能高多少。

获得更高并行性能的方法是在现代多套接字系统上运行代码,并让每个数组中的数据均匀分布在套接字上。然后您可以获得几乎等于 CPU 插槽数量的加速。

关于c++ - 在这种情况下,为什么 PPL 比顺序循环和 OpenMP 慢得多,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/24594454/

25 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com