gpt4 book ai didi

c - 为什么多线程比单线程慢?

转载 作者:行者123 更新时间:2023-11-30 19:46:23 24 4
gpt4 key购买 nike

我编写了一个并行 pthreads 程序,计算以下乘积的列和范数两个 n*n 大小的矩阵。右边的矩阵是垂直划分的。用户输入矩阵大小 n 和线程数 (p),以便:

  1. pthreads 参与并行计算。
  2. 采用矩阵乘法的一维并行算法:
  3. 将右侧矩阵在一维上划分为 p 个相等的切片(A*B,然后将 B 划分为 p 个切片)
  4. 分区和线程之间存在一对一的映射
  5. 每个线程负责计算结果矩阵的相应切片

代码:

double *A;
double *B;
double *C;
int n;
double matrix_norm;

typedef struct {
double *b;
double *c;
int num_of_columns;
pthread_mutex_t *mutex;
} matrix_slice;

void *matrix_slice_multiply(void *arg){
matrix_slice *slice = arg;
int i, j;
cblas_dgemm(CblasRowMajor, CblasNoTrans, CblasNoTrans, n, slice->num_of_columns, n, 1.0, A, n, slice->b, n, 0.0, slice->c, n);

// compute column norm of each slice
double slice_norm = 0.0;
for(j = 0; j < slice->num_of_columns; j++) {
double column_sum=0.;
for(i = 0; i < n; i++)
column_sum += *(slice->c + i * n + j);

if(column_sum>slice_norm)
slice_norm=column_sum;
}
pthread_mutex_lock(slice->mutex);
if (slice_norm>matrix_norm)
matrix_norm=slice_norm;
pthread_mutex_unlock(slice->mutex);

pthread_exit(NULL);
}

int main(void) {
int num_of_thrds, num_of_columns_per_slice;
pthread_t *working_thread;
matrix_slice *slice;
pthread_mutex_t *mutex;
int i = 0;

printf ("Please enter matrix dimension n : ");
scanf("%d", &n);

printf ("Please enter number of threads : ");
scanf("%d", &num_of_thrds);

while (num_of_thrds > n) {
printf("number of threads must not be greater than matrix dimension\n");
printf ("Please enter number of threads : ");
scanf("%d", &num_of_thrds);
}
// allocate memory for the matrices
///////////////////// Matrix A //////////////////////////
A = (double *)malloc(n * n * sizeof(double));

if (!A) {
printf("memory failed \n");
exit(1);
}

///////////////////// Matrix B //////////////////////////
B = (double *)malloc(n * n * sizeof(double));
if (!B) {
printf("memory failed \n");
exit(1);
}

///////////////////// Matrix C //////////////////////////
C = (double *)malloc(n * n * sizeof(double));
if (!C) {
printf("memory failed \n");
exit(1);
}

// initialize the matrices
for (i = 0; i < n * n; i++) {
A[i] = rand() % 15;
B[i] = rand() % 10;
C[i] = 0.;
}

clock_t t1 = clock();
working_thread = malloc(num_of_thrds * sizeof(pthread_t));
slice = malloc(num_of_thrds * sizeof(matrix_slice));
mutex = malloc(sizeof(pthread_mutex_t));
num_of_columns_per_slice = n / num_of_thrds;

for(i = 0; i < num_of_thrds; i++){
slice[i].b = B + i * num_of_columns_per_slice;
slice[i].c = C + i * num_of_columns_per_slice;
slice[i].mutex = mutex;
slice[i].num_of_columns = (i == num_of_thrds - 1) ? n-i * num_of_columns_per_slice : num_of_columns_per_slice;
pthread_create(&working_thread[i], NULL, matrix_slice_multiply, (void *)&slice[i]);
}
for(i = 0; i < num_of_thrds; i++)
pthread_join(working_thread[i], NULL);

clock_t t2=clock();
printf("elapsed time: %f\n", (double)(t2 - t1)/CLOCKS_PER_SEC);

printf("column sum norm is %f\n", matrix_norm);

//deallocate memory
free(A);
free(B);
free(C);
free(working_thread);
free(slice);

return 0;
}

我使用不同的输入运行了该程序数十次,结果发现使用的线程越多,花费的时间就越多。这是相当违反直觉的。更多的线程不应该有助于提高性能吗?

最佳答案

并行运行计算所节省的开销需要大于创建、维护和在线程之间切换的开销。不要使用大量线程运行数十次,而是使用与系统内核数相同的线程数运行一个非常大的操作一次。

关于c - 为什么多线程比单线程慢?,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/23982506/

24 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com