gpt4 book ai didi

c++ - OpenMp 并行

转载 作者:搜寻专家 更新时间:2023-10-31 02:11:04 27 4
gpt4 key购买 nike

我有以下称为 pgain 的方法,它调用我试图并行化的方法 dist:

/******************************************************************************/

/* For a given point x, find the cost of the following operation:
* -- open a facility at x if there isn't already one there,
* -- for points y such that the assignment distance of y exceeds dist(y, x),
* make y a member of x,
* -- for facilities y such that reassigning y and all its members to x
* would save cost, realize this closing and reassignment.
*
* If the cost of this operation is negative (i.e., if this entire operation
* saves cost), perform this operation and return the amount of cost saved;
* otherwise, do nothing.
*/

/* numcenters will be updated to reflect the new number of centers */
/* z is the facility cost, x is the number of this point in the array
points */
double pgain ( long x, Points *points, double z, long int *numcenters )
{
int i;
int number_of_centers_to_close = 0;

static double *work_mem;
static double gl_cost_of_opening_x;
static int gl_number_of_centers_to_close;

int stride = *numcenters + 2;
//make stride a multiple of CACHE_LINE
int cl = CACHE_LINE/sizeof ( double );
if ( stride % cl != 0 ) {
stride = cl * ( stride / cl + 1 );
}
int K = stride - 2 ; // K==*numcenters

//my own cost of opening x
double cost_of_opening_x = 0;

work_mem = ( double* ) malloc ( 2 * stride * sizeof ( double ) );
gl_cost_of_opening_x = 0;
gl_number_of_centers_to_close = 0;

/*
* For each center, we have a *lower* field that indicates
* how much we will save by closing the center.
*/
int count = 0;
for ( int i = 0; i < points->num; i++ ) {
if ( is_center[i] ) {
center_table[i] = count++;
}
}
work_mem[0] = 0;

//now we finish building the table. clear the working memory.
memset ( switch_membership, 0, points->num * sizeof ( bool ) );
memset ( work_mem, 0, stride*sizeof ( double ) );
memset ( work_mem+stride,0,stride*sizeof ( double ) );

//my *lower* fields
double* lower = &work_mem[0];
//global *lower* fields
double* gl_lower = &work_mem[stride];

#pragma omp parallel for
for ( i = 0; i < points->num; i++ ) {
float x_cost = dist ( points->p[i], points->p[x], points->dim ) * points->p[i].weight;
float current_cost = points->p[i].cost;

if ( x_cost < current_cost ) {

// point i would save cost just by switching to x
// (note that i cannot be a median,
// or else dist(p[i], p[x]) would be 0)

switch_membership[i] = 1;
cost_of_opening_x += x_cost - current_cost;

} else {

// cost of assigning i to x is at least current assignment cost of i

// consider the savings that i's **current** median would realize
// if we reassigned that median and all its members to x;
// note we've already accounted for the fact that the median
// would save z by closing; now we have to subtract from the savings
// the extra cost of reassigning that median and its members
int assign = points->p[i].assign;
lower[center_table[assign]] += current_cost - x_cost;
}
}

// at this time, we can calculate the cost of opening a center
// at x; if it is negative, we'll go through with opening it

for ( int i = 0; i < points->num; i++ ) {
if ( is_center[i] ) {
double low = z + work_mem[center_table[i]];
gl_lower[center_table[i]] = low;
if ( low > 0 ) {
// i is a median, and
// if we were to open x (which we still may not) we'd close i

// note, we'll ignore the following quantity unless we do open x
++number_of_centers_to_close;
cost_of_opening_x -= low;
}
}
}
//use the rest of working memory to store the following
work_mem[K] = number_of_centers_to_close;
work_mem[K+1] = cost_of_opening_x;

gl_number_of_centers_to_close = ( int ) work_mem[K];
gl_cost_of_opening_x = z + work_mem[K+1];

// Now, check whether opening x would save cost; if so, do it, and
// otherwise do nothing

if ( gl_cost_of_opening_x < 0 ) {
// we'd save money by opening x; we'll do it
for ( int i = 0; i < points->num; i++ ) {
bool close_center = gl_lower[center_table[points->p[i].assign]] > 0 ;
if ( switch_membership[i] || close_center ) {
// Either i's median (which may be i itself) is closing,
// or i is closer to x than to its current median
points->p[i].cost = points->p[i].weight * dist ( points->p[i], points->p[x], points->dim );
points->p[i].assign = x;
}
}
for ( int i = 0; i < points->num; i++ ) {
if ( is_center[i] && gl_lower[center_table[i]] > 0 ) {
is_center[i] = false;
}
}
if ( x >= 0 && x < points->num ) {
is_center[x] = true;
}

*numcenters = *numcenters + 1 - gl_number_of_centers_to_close;
} else {
gl_cost_of_opening_x = 0; // the value we'll return
}

free ( work_mem );

return -gl_cost_of_opening_x;
}

我试图并行化的函数:

/* compute Euclidean distance squared between two points */
float dist ( Point p1, Point p2, int dim )
{
float result=0.0;
#pragma omp parallel for reduction(+:result)
for (int i=0; i<dim; i++ ){
result += ( p1.coord[i] - p2.coord[i] ) * ( p1.coord[i] - p2.coord[i] );
}
return ( result );
}

重点是:

/* this structure represents a point */
/* these will be passed around to avoid copying coordinates */
typedef struct {
float weight;
float *coord;
long assign; /* number of point where this one is assigned */
float cost; /* cost of that assignment, weight*distance */
} Point;

我有一个大型的 streamcluster 应用程序(815 行代码),它生成实时数字并以特定方式对它们进行排序。我在 Linux 上使用过 scalasca 工具,所以我可以测量占用大部分时间的方法,我发现上面列出的方法 dist 是最耗时的。我正在尝试使用 openMP 工具,但并行代码运行的时间比串行代码的运行时间长。如果串行代码在 1.5 秒内运行,则并行化需要 20,但结果是相同的。我想知道是不是由于某种原因我无法并行化这部分代码,或者我没有正确执行。我试图在调用树中对其进行并行化的方法:main->pkmedian->pFL->pgain->dist(-> 表示调用以下方法)

最佳答案

您选择并行化的代码:

float result=0.0;
#pragma omp parallel for reduction(+:result)
for (int i=0; i<dim; i++ ){
result += ( p1.coord[i] - p2.coord[i] ) * ( p1.coord[i] - p2.coord[i] );
}

不太适合从并行化中获益。您不应在此处使用 parallel for。您可能不应该在内循环上使用并行化。如果您可以并行化一些外部循环,您会更愿意看到 yield 。

协调线程组启动并行区域会产生开销,之后执行缩减也会产生开销。同时,并行区域的内容基本上不需要运行时间。鉴于此,您需要将 dim 设置得非常大,然后才能期望这会带来性能优势。

为了更形象地表达这一点,请考虑您正在进行的数学运算将花费纳秒并将其与显示各种 OpenMP 指令的开销的图表进行比较。

Graph of OpenMP directive overheads

如果您需要它运行得更快,您的第一站应该是使用适当的编译标志,然后查看 SIMD 操作:SSE 和 AVX 是很好的关键字。您的编译器甚至可能会自动调用它们。

我构建了一些测试代码(见下文)并在启用各种优化的情况下编译它,如下所列,并在 100,000 个元素的数组上运行它。请注意,启用 -O3 会产生与 OpenMP 指令顺序相同的运行时。这意味着在考虑使用 OpenMP 之前,您需要大约 400,000 个数组,为了安全起见,可能更接近 1,000,000 个。

  • 没有优化。运行时间约为 1900 微秒。
  • -O3:启用许多优化。运行时间约为 200 微秒。
  • -ffast-math:你想要这个,除非你正在做一些非常棘手的事情。运行时间大致相同。
  • -march=native:编译代码以使用 CPU 的全部功能,而不是可以在许多 CPU 上运行的通用指令集。运行时间约为 100 微秒。

我们开始吧,战略性地使用编译器选项 (-march=native) 可以使相关代码的速度加倍,而无需处理并行问题。

Here是一个方便的幻灯片演示文稿,其中包含一些解释如何以高性能方式使用 OpenMP 的提示。

测试代码:

#include <vector>
#include <cstdlib>
#include <chrono>
#include <iostream>

int main(){
std::vector<double> a;
std::vector<double> b;
for(int i=0;i<100000;i++){
a.push_back(rand()/(double)RAND_MAX);
b.push_back(rand()/(double)RAND_MAX);
}

std::chrono::steady_clock::time_point begin = std::chrono::steady_clock::now();

float result = 0.0;
//#pragma omp parallel for reduction(+:result)
for (unsigned int i=0; i<a.size(); i++ )
result += ( a[i] - b[i] ) * ( a[i] - b[i] );

std::chrono::steady_clock::time_point end= std::chrono::steady_clock::now();

std::cout << "Time difference = " << std::chrono::duration_cast<std::chrono::microseconds>(end - begin).count() << " microseconds"<<std::endl;
}

关于c++ - OpenMp 并行,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/44097890/

27 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com