gpt4 book ai didi

c++ - CPU 数量的增加会降低性能,CPU 负载不变并且没有通信

转载 作者:太空宇宙 更新时间:2023-11-04 02:29:24 28 4
gpt4 key购买 nike

我遇到了一个我无法解释的有趣现象。我还没有在网上找到答案,因为大多数帖子都涉及弱扩展和通信开销。

下面是一小段代码来说明问题。这是用不同的语言测试的,结果相似,因此有多个标签。

#include <mpi.h>
#include <stdio.h>
#include <time.h>

int main() {

MPI_Init(NULL,NULL);

int wsize;
MPI_Comm_size(MPI_COMM_WORLD, &wsize);

int wrank;
MPI_Comm_rank(MPI_COMM_WORLD, &wrank);


clock_t t;

MPI_Barrier(MPI_COMM_WORLD);

t=clock();

int imax = 10000000;
int jmax = 1000;
for (int i=0; i<imax; i++) {
for (int j=0; j<jmax; j++) {
//nothing
}
}

t=clock()-t;

printf( " proc %d took %f seconds.\n", wrank,(float)t/CLOCKS_PER_SEC );

MPI_Finalize();

return 0;

}

现在如您所见,此处唯一计时的部分是循环。因此,在类似的 CPU、没有超线程和足够的 RAM 的情况下,增加 CPU 的数量应该产生完全相同的时间。

但是,在我的 32 核 15GiB RAM 机器上,

mpirun -np 1 ./test 

给予

 proc 0 took 22.262777 seconds.

但是

mpirun -np 20 ./test

给予

 proc 18 took 24.440767 seconds.
proc 0 took 24.454365 seconds.
proc 4 took 24.461191 seconds.
proc 15 took 24.467632 seconds.
proc 14 took 24.469728 seconds.
proc 7 took 24.469809 seconds.
proc 5 took 24.461639 seconds.
proc 11 took 24.484224 seconds.
proc 9 took 24.491638 seconds.
proc 2 took 24.484953 seconds.
proc 17 took 24.490984 seconds.
proc 16 took 24.502146 seconds.
proc 3 took 24.513380 seconds.
proc 1 took 24.541555 seconds.
proc 8 took 24.539808 seconds.
proc 13 took 24.540005 seconds.
proc 12 took 24.556068 seconds.
proc 10 took 24.528328 seconds.
proc 19 took 24.585297 seconds.
proc 6 took 24.611254 seconds.

对于不同数量的 CPU,值介于两者之间。

htop 还显示 RAM 消耗增加(VIRT 为 1 核约 100M,20 核约 300M)。尽管这可能与 mpi 通信器的大小有关?

最后,它肯定与问题的大小有关(因此无论循环大小如何,通信开销都会导致持续延迟)。事实上,将 imax 降低到 10 000 会使墙时间相似。

1 个核心:

 proc 0 took 0.028439 seconds.

20 个核心:

 proc 1 took 0.027880 seconds.
proc 12 took 0.027880 seconds.
proc 8 took 0.028024 seconds.
proc 16 took 0.028135 seconds.
proc 17 took 0.028094 seconds.
proc 19 took 0.028098 seconds.
proc 7 took 0.028265 seconds.
proc 9 took 0.028051 seconds.
proc 13 took 0.028259 seconds.
proc 18 took 0.028274 seconds.
proc 5 took 0.028087 seconds.
proc 6 took 0.028032 seconds.
proc 14 took 0.028385 seconds.
proc 15 took 0.028429 seconds.
proc 0 took 0.028379 seconds.
proc 2 took 0.028367 seconds.
proc 3 took 0.028291 seconds.
proc 4 took 0.028419 seconds.
proc 10 took 0.028419 seconds.
proc 11 took 0.028404 seconds.

在多台机器上试过,结果相似。也许我们遗漏了一些非常简单的东西。

感谢您的帮助!

最佳答案

具有受温度限制的涡轮频率的处理器。

现代处理器受到热设计功率 (TDP) 的限制。每当处理器处于冷态时,单核可能会加速到涡轮倍频器。当热或多个非空闲内核时,内核会减慢到保证的基本速度。基本速度和涡轮速度之间的差异通常在 400MHz 左右。 AVX 或 FMA3 可能会减速甚至低于基本速度。

关于c++ - CPU 数量的增加会降低性能,CPU 负载不变并且没有通信,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/45983371/

28 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com