gpt4 book ai didi

使用 C 程序理解 CPU 缓存

转载 作者:行者123 更新时间:2023-12-03 03:39:11 26 4
gpt4 key购买 nike

我正在尝试使用 C 程序来理解 CPU 缓存和缓存行,就像我对大多数 C 概念所做的那样。我使用的程序如下所示。我从博客中得到了这个想法。

http://igoro.com/archive/gallery-of-processor-cache-effects/

现在以下程序在我的机器上的输出如下所示。这是 CFLAGS="-g -O0 -Wall"的输出。

./cache
CPU time for loop 1 0.460000 secs.
CPU time for loop 2 (j = 8) 0.050000 secs.
CPU time for loop 2 (j = 9) 0.050000 secs.
CPU time for loop 2 (j = 10) 0.050000 secs.
CPU time for loop 2 (j = 11) 0.050000 secs.
CPU time for loop 2 (j = 12) 0.040000 secs.
CPU time for loop 2 (j = 13) 0.050000 secs.
CPU time for loop 2 (j = 14) 0.050000 secs.
CPU time for loop 2 (j = 15) 0.040000 secs.
CPU time for loop 2 (j = 16) 0.050000 secs.
CPU time for loop 2 (j = 17) 0.040000 secs.
CPU time for loop 2 (j = 18) 0.050000 secs.
CPU time for loop 2 (j = 19) 0.040000 secs.
CPU time for loop 2 (j = 20) 0.040000 secs.
CPU time for loop 2 (j = 21) 0.040000 secs.
CPU time for loop 2 (j = 22) 0.040000 secs.
CPU time for loop 2 (j = 23) 0.040000 secs.
CPU time for loop 2 (j = 24) 0.030000 secs.
CPU time for loop 2 (j = 25) 0.040000 secs.
CPU time for loop 2 (j = 26) 0.030000 secs.
CPU time for loop 2 (j = 27) 0.040000 secs.
CPU time for loop 2 (j = 28) 0.030000 secs.
CPU time for loop 2 (j = 29) 0.040000 secs.
CPU time for loop 2 (j = 30) 0.030000 secs.
CPU time for loop 2 (j = 31) 0.030000 secs.

经过优化的输出 (CFLAGS=-g -O3 -Wall)

CPU time for loop 1 0.130000 secs.
CPU time for loop 2 (j = 8) 0.040000 secs.
CPU time for loop 2 (j = 9) 0.050000 secs.
CPU time for loop 2 (j = 10) 0.050000 secs.
CPU time for loop 2 (j = 11) 0.040000 secs.
CPU time for loop 2 (j = 12) 0.040000 secs.
CPU time for loop 2 (j = 13) 0.050000 secs.
CPU time for loop 2 (j = 14) 0.050000 secs.
CPU time for loop 2 (j = 15) 0.040000 secs.
CPU time for loop 2 (j = 16) 0.040000 secs.
CPU time for loop 2 (j = 17) 0.050000 secs.
CPU time for loop 2 (j = 18) 0.040000 secs.
CPU time for loop 2 (j = 19) 0.050000 secs.
CPU time for loop 2 (j = 20) 0.040000 secs.
CPU time for loop 2 (j = 21) 0.040000 secs.
CPU time for loop 2 (j = 22) 0.040000 secs.
CPU time for loop 2 (j = 23) 0.030000 secs.
CPU time for loop 2 (j = 24) 0.040000 secs.
CPU time for loop 2 (j = 25) 0.030000 secs.
CPU time for loop 2 (j = 26) 0.040000 secs.
CPU time for loop 2 (j = 27) 0.030000 secs.
CPU time for loop 2 (j = 28) 0.030000 secs.
CPU time for loop 2 (j = 29) 0.030000 secs.
CPU time for loop 2 (j = 30) 0.030000 secs.
CPU time for loop 2 (j = 31) 0.030000 secs.

博客中指出

The first loop multiplies every value in the array by 3, and the second loop multiplies > only every 16-th. The second loop only does about 6% of the work of the first loop, but on modern machines, the two for-loops take about the same time: 80 and 78 ms respectively on my machine.

我的机器上似乎不是这种情况。可以看到,执行的时间

loop 1 is 0.46 seconds.

还有

loop 2 is 0.03 seconds or 0.04 seconds or 0.05 seconds

对于不同的 j 值。

为什么会发生这种情况?

#include <stdio.h>
#include <sys/time.h>
#include <time.h>
#include <unistd.h>
#include <stdlib.h>

#define MAX_SIZE (64*1024*1024)

int main()
{
clock_t start, end;
double cpu_time;
int i = 0;
int j = 0;
/* MAX_SIZE array is too big for stack. This is an unfortunate rough edge of the way the stack works.
It lives in a fixed-size buffer, set by the program executable's configuration according to the
operating system, but its actual size is seldom checked against the available space. */
/* int arr[MAX_SIZE]; */

int *arr = (int*)malloc(MAX_SIZE * sizeof(int));

/* CPU clock ticks count start */
start = clock();

/* Loop 1 */
for (i = 0; i < MAX_SIZE; i++)
arr[i] *= 3;

/* CPU clock ticks count stop */
end = clock();

cpu_time = ((double) (end - start)) / CLOCKS_PER_SEC;

printf("CPU time for loop 1 %.6f secs.\n", cpu_time);

for (j = 8 ; j < 32 ; j++)
{
/* CPU clock ticks count start */
start = clock();

/* Loop 2 */
for (i = 0; i < MAX_SIZE; i += j)
arr[i] *= 3;

/* CPU clock ticks count stop*/
end = clock();

cpu_time = ((double) (end - start)) / CLOCKS_PER_SEC;

printf("CPU time for loop 2 (j = %d) %.6f secs.\n", j, cpu_time);
}

return 0;
}

最佳答案

我稍微修改了代码。首先是修改摘要:

  1. 将 MAX_SIZE 显着增大,以确保在情况发生变化时存在真正的差异。 (它现在使用完整的 2 GB 内存,因此请勿在 32 位操作系统上执行此操作)
  2. 运行循环 1 几次(在我的机器上,这会产生一些差异,因为我的机器第一次运行速度会变慢 - 这可能是因为 malloc 实际上并未将内存映射到进程中'地址空间,因此在第一个循环中,我们在内存中映射时获得了一些额外的开销)。它还确保 CPU 在执行其他循环时以“全速”运行,而不是“省电”速度。
  3. 改变 j在第二个循环中通过乘以 2 更快地获取值(在本例中 <<= 1*= 2 相同 - 使用 shift 的老习惯)
  4. 使用+= 3而不是*= 3 。 (乘法比 += 慢一点,但在这种情况下差别不大。
  5. 添加loop3它执行的操作数量与loop2完全相同,但内存范围较小[使用 &用 2n-1 值来限制范围]。

我用 gcc -Wall -O3 -sdc=c99 编译了代码,使用版本 4.6.3,并在四核 Athlon 965、Fedora Core 16 x86-64 和 16 GB RAM 上运行。

代码如下:

#include <stdio.h>
#include <sys/time.h>
#include <time.h>
#include <unistd.h>
#include <stdlib.h>

#define MAX_SIZE (512*1024*1024)

int main()
{
clock_t start, end;
double cpu_time;
int i = 0;
int j = 0;
/* MAX_SIZE array is too big for stack.This is an unfortunate rough edge of the way the stack works.
It lives in a fixed-size buffer, set by the program executable's configuration according to the
operating system, but its actual size is seldom checked against the available space. */
/* int arr[MAX_SIZE]; */

int *arr = (int*)malloc(MAX_SIZE * sizeof(int));

/* CPU clock ticks count start */

for(int k = 0; k < 3; k++)
{
start = clock();

/* Loop 1 */
for (i = 0; i < MAX_SIZE; i++)
arr[i] += 3;

/* CPU clock ticks count stop */
end = clock();

cpu_time = ((double) (end - start)) / CLOCKS_PER_SEC;

printf("CPU time for loop 1 %.6f secs.\n", cpu_time);
}

for (j = 1 ; j <= 1024 ; j <<= 1)
{
/* CPU clock ticks count start */
start = clock();

/* Loop 2 */
for (i = 0; i < MAX_SIZE; i += j)
arr[i] += 3;

/* CPU clock ticks count stop */
end = clock();

cpu_time = ((double) (end - start)) / CLOCKS_PER_SEC;

printf("CPU time for loop 2 (j = %d) %.6f secs.\n", j, cpu_time);
}


// Third loop, performing the same operations as loop 2,
// but only touching 16KB of memory
for (j = 1 ; j <= 1024 ; j <<= 1)
{
/* CPU clock ticks count start */
start = clock();

/* Loop 3 */
for (i = 0; i < MAX_SIZE; i += j)
arr[i & 0xfff] += 3;

/* CPU clock ticks count stop */
end = clock();

cpu_time = ((double) (end - start)) / CLOCKS_PER_SEC;

printf("CPU time for loop 3 (j = %d) %.6f secs.\n", j, cpu_time);
}
return 0;
}

结果:

CPU time for loop 1 2.950000 secs.
CPU time for loop 1 0.630000 secs.
CPU time for loop 1 0.630000 secs.
CPU time for loop 2 (j = 1) 0.780000 secs.
CPU time for loop 2 (j = 2) 0.700000 secs.
CPU time for loop 2 (j = 4) 0.610000 secs.
CPU time for loop 2 (j = 8) 0.540000 secs.
CPU time for loop 2 (j = 16) 0.560000 secs.
CPU time for loop 2 (j = 32) 0.280000 secs.
CPU time for loop 2 (j = 64) 0.140000 secs.
CPU time for loop 2 (j = 128) 0.090000 secs.
CPU time for loop 2 (j = 256) 0.060000 secs.
CPU time for loop 2 (j = 512) 0.030000 secs.
CPU time for loop 2 (j = 1024) 0.040000 secs.
CPU time for loop 3 (j = 1) 0.470000 secs.
CPU time for loop 3 (j = 2) 0.240000 secs.
CPU time for loop 3 (j = 4) 0.120000 secs.
CPU time for loop 3 (j = 8) 0.050000 secs.
CPU time for loop 3 (j = 16) 0.030000 secs.
CPU time for loop 3 (j = 32) 0.020000 secs.
CPU time for loop 3 (j = 64) 0.010000 secs.
CPU time for loop 3 (j = 128) 0.000000 secs.
CPU time for loop 3 (j = 256) 0.000000 secs.
CPU time for loop 3 (j = 512) 0.000000 secs.
CPU time for loop 3 (j = 1024) 0.000000 secs.

如您所见,loop2的前几个花费相同的时间 - 一旦达到 32,时间就开始减少,因为处理器不需要每个缓存行,但在 loop3 中在这种情况下,每个循环中的操作数量会相当直接地影响总时间。

编辑:

乘法 ( *=3 ) 与加法 ( +=3 ) 并没有真正产生太大的区别,除了在 Loop3 情况下,它增加了大约 30% 的循环时间。

关于使用 C 程序理解 CPU 缓存,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/17138599/

26 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com