gpt4 book ai didi

c - 为什么我的程序在嵌套时会生成随机结果?

转载 作者:行者123 更新时间:2023-11-30 15:01:17 24 4
gpt4 key购买 nike

我在 OpenMP 中使用 for 循环嵌套编写了这个并行矩阵乘法程序。当我运行程序时,随机(大部分)显示结果矩阵的不同索引的答案。这是代码片段:

#pragma omp parallel for

for(i=0;i<N;i++){
#pragma omp parallel for
for(j=0;j<N;j++){
C[i][j]=0;
#pragma omp parallel for
for(m=0;m<N;m++){
C[i][j]=A[i][m]*B[m][j]+C[i][j];
}
printf("C:i=%d j=%d %f \n",i,j,C[i][j]);
}
}

最佳答案

正如评论者已经指出的那样,这些是所谓的“竞争条件”的症状。

OpenMP 使用的线程彼此独立,但矩阵乘法的各个循环的结果并非如此,因此一个线程可能与另一个线程位于不同的位置,如果您依赖于结果的顺序。

您只能并行化最外层循环:

#include <stdio.h>
#include <stdlib.h>
#include <sys/time.h>

int main(int argc, char **argv)
{
int n;
double **A, **B, **C, **D, t;
int i, j, k;
struct timeval start, stop;

if (argc != 2) {
fprintf(stderr, "Usage: %s a positive integer >= 2 and < 1 mio\n", argv[0]);
exit(EXIT_FAILURE);
}

n = atoi(argv[1]);
if (n <= 2 || n >= 1000000) {
fprintf(stderr, "Usage: %s a positive integer >= 2 and < 1 mio\n", argv[0]);
exit(EXIT_FAILURE);
}
// make it repeatable
srand(0xdeadbeef);

// allocate memory for and initialize A
A = malloc(sizeof(*A) * n);
for (i = 0; i < n; i++) {
A[i] = malloc(sizeof(**A) * n);
for (j = 0; j < n; j++) {
A[i][j] = (double) ((rand() % 100) / 99.);
}
}
// do the same for B
B = malloc(sizeof(*B) * n);
for (i = 0; i < n; i++) {
B[i] = malloc(sizeof(**B) * n);
for (j = 0; j < n; j++) {
B[i][j] = (double) ((rand() % 100) / 99.);
}
}

// and C but initialize with zero
C = malloc(sizeof(*C) * n);
for (i = 0; i < n; i++) {
C[i] = malloc(sizeof(**C) * n);
for (j = 0; j < n; j++) {
C[i][j] = 0.0;
}
}

// ditto with D
D = malloc(sizeof(*D) * n);
for (i = 0; i < n; i++) {
D[i] = malloc(sizeof(**D) * n);
for (j = 0; j < n; j++) {
D[i][j] = 0.0;
}
}

// some coarse timing
gettimeofday(&start, NULL);
// naive matrix multiplication
for (i = 0; i < n; i++) {
for (j = 0; j < n; j++) {
for (k = 0; k < n; k++) {
C[i][j] = C[i][j] + A[i][k] * B[k][j];
}
}
}
gettimeofday(&stop, NULL);
t = ((stop.tv_sec - start.tv_sec) * 1000000u +
stop.tv_usec - start.tv_usec) / 1.e6;
printf("Timing for naive run = %.10g\n", t);

gettimeofday(&start, NULL);
#pragma omp parallel shared(A, B, C) private(i, j, k)
#pragma omp for
for (i = 0; i < n; i++) {
for (j = 0; j < n; j++) {
for (k = 0; k < n; k++) {
D[i][j] = D[i][j] + A[i][k] * B[k][j];
}
}
}
gettimeofday(&stop, NULL);
t = ((stop.tv_sec - start.tv_sec) * 1000000u +
stop.tv_usec - start.tv_usec) / 1.e6;
printf("Timing for parallel run = %.10g\n", t);

// check result
for (i = 0; i < n; i++) {
for (j = 0; j < n; j++) {
if (D[i][j] != C[i][j]) {
printf("Cell %d,%d differs with delta(D_ij-C_ij) = %.20g\n", i, j,
D[i][j] - C[i][j]);
}
}
}

// clean up
for (i = 0; i < n; i++) {
free(A[i]);
free(B[i]);
free(C[i]);
free(D[i]);
}
free(A);
free(B);
free(C);
free(D);

puts("All ok? Bye");

exit(EXIT_SUCCESS);
}

(n>2000 可能需要一些耐心才能得到结果)

但这并不完全正确。您可以(但不应该)尝试使用类似

的内容来获取最内层的循环
sum = 0.0;
#pragma omp parallel for reduction(+:sum)
for (k = 0; k < n; k++) {
sum += A[i][k] * B[k][j];
}
D[i][j] = sum;

似乎并没有更快,n 较小时甚至更慢。使用原始代码和 n = 2500(仅运行一次):

Timing for naive run    = 124.466307
Timing for parallel run = 44.154538

与减少大致相同:

Timing for naive run    = 119.586365
Timing for parallel run = 43.288371

较小的n = 500

Timing for naive run    = 0.444061
Timing for parallel run = 0.150842

在该大小下,它已经变慢了:

Timing for naive run    = 0.447894
Timing for parallel run = 0.245481

它可能会赢得非常n,但我缺乏必要的耐心。尽管如此,最后一个 n = 4000 (仅限 OpenMP 部分):

正常:

Timing for parallel run = 174.647404

减少:

Timing for parallel run = 179.062463

这种差异仍然完全在误差线之内。

乘以大型矩阵(大约 n>100)的更好方法是 Schönhage-Straßen 算法。

哦:我只是为了方便才使用方阵,并不是因为它们必须是那种形式!但是,如果您有具有较大长度比的矩形矩阵,您可能会尝试改变循环的运行方式;列优先或行优先在这里可以产生显着的差异。

关于c - 为什么我的程序在嵌套时会生成随机结果?,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/41526396/

24 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com