gpt4 book ai didi

c++ - OpenMP 积分图像比顺序图像慢

转载 作者:塔克拉玛干 更新时间:2023-11-03 07:05:54 24 4
gpt4 key购买 nike

我已经使用 OpenMP 在 C++ 中实现了总面积表(或积分图像)。

问题是顺序代码总是比并行代码快,即使改变线程数和图像大小也是如此。

例如,我尝试了从 (100x100) 到 (10000x10000) 的图像和从 1 到 64 的线程,但没有一个组合是更快的。

我也在不同的机器上试过这段代码,比如:

  • Mac OSX 1.4 GHz Intel Core i5 双核
  • Mac OSX 2.3 GHz Intel Core i7 四核
  • Ubuntu 16.04 Intel Xeon E5-2620 2.4 GHz 12 核

已使用 OpenMP 函数测量时间:omp_get_wtime()

对于编译,我使用:g++ -fopenmp -Wall main.cpp

并行代码如下:

void transpose(unsigned long *src, unsigned long *dst, const int N, const int M) {
#pragma omp parallel for
for(int n = 0; n<N*M; n++) {
int i = n/N;
int j = n%N;
dst[n] = src[M*j + i];
}
}


unsigned long * integralImageMP(uint8_t*x, int n, int m){

unsigned long * out = new unsigned long[n*m];
unsigned long * rows = new unsigned long[n*m];

#pragma omp parallel for
for (int i = 0; i < n; ++i)
{
rows[i*m] = x[i*m];
for (int j = 1; j < m; ++j)
{
rows[i*m + j] = x[i*m + j] + rows[i*m + j - 1];
}
}

transpose(rows, out, n, m);

#pragma omp parallel for
for (int i = 0; i < n; ++i)
{
rows[i*m] = out[i*m];
for (int j = 1; j < m; ++j)
{
rows[i*m + j] = out[i*m + j] + rows[i*m + j - 1];
}
}

transpose(rows, out, m, n);

delete [] rows;
return out;
}

这是顺序代码:

unsigned long * integralImage(uint8_t*x, int n, int m){
unsigned long * out = new unsigned long[n*m];

for (int i = 0; i < n; ++i)
{
for (int j = 0; j < m; ++j)
{
unsigned long val = x[i*m + j];
if (i>=1)
{
val += out[(i-1)*m + j];
if (j>=1)
{
val += out[i*m + j - 1] - out[(i-1)*m + j - 1];
}
} else {
if (j>=1)
{
val += out[i*m + j -1];
}
}
out[i*m + j] = val;
}
}

return out;
}

我也尝试过不使用 transpose 但它更慢可能是因为缓存访问。

调用代码示例:

int main(int argc, char **argv){
uint8_t* image = //read image from file (gray scale)
int height = //height of the image
int width = //width of the image

double start_omp = omp_get_wtime();

unsigned long* integral_image_parallel = integralImageMP(image, height, width); //parallel

double end_omp = omp_get_wtime();

double time_tot = end_omp - start_omp;

std::cout << time_tot << std::endl;

start_omp = omp_get_wtime();

unsigned long* integral_image_serial = integralImage(image, height, width); //sequential

end_omp = omp_get_wtime();

time_tot = end_omp - start_omp;

std::cout << time_tot << std::endl;

return 0;
}

每个线程都在处理一个行 block (也许每个线程正在做的事情的说明会很有用): enter image description here其中 ColumnSum 完成转置矩阵并重复 RowSum。

最佳答案

首先让我说,结果让我有点惊讶,我猜测问题出在转置算法所需的非本地内存访问中。

无论如何,您都可以通过两遍方法将顺序算法转换为并行算法来缓解它。第一遍必须在相隔 N 行的 T 个线程中计算二维积分,第二遍必须补偿每个 block 不是从前一行的累积结果而是从零开始的事实。

使用 Matlab 的示例以 2D 形式显示了原理。

 f=fix(rand(12,8)*8)   % A random matrix with 12 rows, 8 columns

5 6 1 4 7 5 4 4
4 6 0 7 1 3 2 0
7 0 2 3 0 1 6 3
5 3 1 7 4 3 7 2
6 4 3 2 7 3 5 1
3 3 2 5 5 0 2 1
3 5 7 5 1 4 4 3
6 5 7 4 2 1 0 0
0 2 0 5 3 3 7 4
1 3 5 5 7 4 7 3
1 0 2 1 1 2 6 5
3 7 3 1 6 2 2 5


ff=cumsum(cumsum(f')') % The Summed Area Table
5 11 12 16 23 28 32 36
9 21 22 33 41 49 55 59
16 28 31 45 53 62 74 81
21 36 40 61 73 85 104 113
27 46 53 76 95 110 134 144
30 52 61 89 113 128 154 165
33 60 76 109 134 153 183 197
39 71 94 131 158 178 208 222
39 73 96 138 168 191 228 246
40 77 105 152 189 216 260 281
41 78 108 156 194 223 273 299
44 88 121 170 214 245 297 328

fx=[cumsum(cumsum(f(1:4,:)')'); % The original table summed in
cumsum(cumsum(f(5:8,:)')'); % three parts -- 4 rows per each
cumsum(cumsum(f(9:12,:)')')] % "thread"

5 11 12 16 23 28 32 36
9 21 22 33 41 49 55 59
16 28 31 45 53 62 74 81
21 36 40 61 73 85 104 113 %% Notice this row #4
6 10 13 15 22 25 30 31
9 16 21 28 40 43 50 52
12 24 36 48 61 68 79 84
18 35 54 70 85 93 104 109 %% Notice this row #8
0 2 2 7 10 13 20 24
1 6 11 21 31 38 52 59
2 7 14 25 36 45 65 77
5 17 27 39 56 67 89 106

fx(4,:) + fx(8,:) %% this is the SUM of row #4 and row #8
39 71 94 131 158 178 208 222

%% and finally -- what is the difference of the piecewise
%% calculated result and the real result?
ff-fx

0 0 0 0 0 0 0 0 %% look !! the first block
0 0 0 0 0 0 0 0 %% is already correct
0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0
21 36 40 61 73 85 104 113 %% All these rows in this
21 36 40 61 73 85 104 113 %% block are short by
21 36 40 61 73 85 104 113 %% the row #4 above
21 36 40 61 73 85 104 113 %%
39 71 94 131 158 178 208 222 %% and all these rows
39 71 94 131 158 178 208 222 %% in this block are short
39 71 94 131 158 178 208 222 %% by the SUM of the rows
39 71 94 131 158 178 208 222 %% #4 and #8 above

幸运的是,可以开始整合 block 2,即在 block #1 得到补偿之前的第 2N..3N-1 行 - 只需计算偏移量,这是一个相对较小的顺序 任务。

acc_for_block_2 = row[2*N-1] + row[N-1];
acc_for_block_3 = acc_for_block_2 + row[3*N-1];
..
acc_for_block_T-1 = acc_for_block_(T-2) + row[N*(T-1)-1];

关于c++ - OpenMP 积分图像比顺序图像慢,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/50352166/

24 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com