C++缓存性能奇怪的行为-6ren

C++缓存性能奇怪的行为

转载作者：塔克拉玛干更新时间：2023-11-03 00:15:05

我读了一篇文章(1.5岁的http://www.drdobbs.com/parallel/cache-friendly-code-solving-manycores-ne/240012736)，其中谈到了缓存性能和数据大小。他们显示了以下代码，他们说它们在i7(桑迪桥)上运行

static volatile int array[Size];
static void test_function(void)
{
    for (int i = 0; i < Iterations; i++)
        for (int x = 0; x < Size; x++)
          array[x]++;
}

他们声称，如果保持Size * Iterations不变，则增加Size，当数组内存的大小增加到超过L2高速缓存大小时，它们会观察到执行所需的巨大时间高峰(10x)。

作为我自己的练习，我想尝试一下以查看是否可以在我的机器上重现它们的结果。 (i7 3770k，win7，Visual c++ 2012编译器，Win32 Debug模式，未启用优化)。但是令我惊讶的是，我看不到执行时间的增加(甚至超过了L3缓存的大小)，这使我认为编译器正在某种程度上优化此代码。但是我也没有看到任何优化。我看到的唯一速度变化是，在我的机器字大小以下，它需要花费更长的时间。以下是我的时间安排，代码 list 和相关的反汇编。

有人知道原因吗:

1)为什么无论数组大小如何，花费的时间都根本不增加？或者我怎么能找到答案？

2)为什么要花很长时间才开始，然后减少直到达到高速缓存行大小，如果数据小于行大小，是否应该在不从高速缓存中读取的情况下处理更多的迭代？

时间:

Size=1,Iterations=1073741824, Time=3829
Size=2,Iterations=536870912, Time=2625
Size=4,Iterations=268435456, Time=2563
Size=16,Iterations=67108864, Time=2906
Size=32,Iterations=33554432, Time=3469
Size=64,Iterations=16777216, Time=3250
Size=256,Iterations=4194304, Time=3140
Size=1024,Iterations=1048576, Time=3110
Size=2048,Iterations=524288, Time=3187
Size=4096,Iterations=262144, Time=3078
Size=8192,Iterations=131072, Time=3125
Size=16384,Iterations=65536, Time=3109
Size=32768,Iterations=32768, Time=3078
Size=65536,Iterations=16384, Time=3078
Size=262144,Iterations=4096, Time=3172
Size=524288,Iterations=2048, Time=3109
Size=1048576,Iterations=1024, Time=3094
Size=2097152,Iterations=512, Time=3313
Size=4194304,Iterations=256, Time=3391
Size=8388608,Iterations=128, Time=3312
Size=33554432,Iterations=32, Time=3109
Size=134217728,Iterations=8, Time=3515
Size=536870912,Iterations=2, Time=3532

代码:

#include <string>
#include <cassert>
#include <windows.h>

template <unsigned int SIZE, unsigned int ITERATIONS>
static void test_body(volatile char* array)
{
     for (unsigned int i = 0; i < ITERATIONS; i++)
    {
        for (unsigned int  x = 0; x < SIZE; x++)
        {
            array[x]++;
        }
    }
}

template <unsigned int SIZE, unsigned int ITERATIONS>
static void test_function()
{
    assert(SIZE*ITERATIONS == 1024*1024*1024);
    static volatile char array[SIZE];

    test_body<SIZE, 1>(array); //warmup

    DWORD beginTime = GetTickCount();
    test_body<SIZE, ITERATIONS>(array); 
    DWORD endTime= GetTickCount();
    printf("Size=%u,Iterations=%u, Time=%d\n", SIZE,ITERATIONS, endTime-beginTime);
}

int main()
{
    enum { eIterations= 1024*1024*1024};
    test_function<1, eIterations>();
    test_function<2, eIterations/2>();
    test_function<4, eIterations/4>();
    test_function<16, eIterations/16>();
    test_function<32, eIterations/ 32>();
    test_function<64, eIterations/ 64>();
    test_function<256, eIterations/ 256>();
    test_function<1024, eIterations/ 1024>();
    test_function<2048, eIterations/ 2048>();
    test_function<4096, eIterations/ 4096>();
    test_function<8192, eIterations/ 8192>();
    test_function<16384, eIterations/ 16384>();
    test_function<32768, eIterations/ 32768>();
    test_function<65536, eIterations/ 65536>();
    test_function<262144, eIterations/ 262144>();
    test_function<524288, eIterations/ 524288>();
    test_function<1048576, eIterations/ 1048576>();
    test_function<2097152, eIterations/ 2097152>();
    test_function<4194304, eIterations/ 4194304>();
    test_function<8388608, eIterations/ 8388608>();
    test_function<33554432, eIterations/ 33554432>();
    test_function<134217728, eIterations/ 134217728>();
    test_function<536870912, eIterations/ 536870912>();
}

拆卸

    for (unsigned int i = 0; i < ITERATIONS; i++)
00281A59  mov         dword ptr [ebp-4],0  
00281A60  jmp         test_body<536870912,2>+1Bh (0281A6Bh)  
00281A62  mov         eax,dword ptr [ebp-4]  
00281A65  add         eax,1  
00281A68  mov         dword ptr [ebp-4],eax  
00281A6B  cmp         dword ptr [ebp-4],2  
00281A6F  jae         test_body<536870912,2>+53h (0281AA3h)  
    {
        for (unsigned int  x = 0; x < SIZE; x++)
00281A71  mov         dword ptr [ebp-8],0  
00281A78  jmp         test_body<536870912,2>+33h (0281A83h)  
00281A7A  mov         eax,dword ptr [ebp-8]  
    {
        for (unsigned int  x = 0; x < SIZE; x++)
00281A7D  add         eax,1  
00281A80  mov         dword ptr [ebp-8],eax  
00281A83  cmp         dword ptr [ebp-8],20000000h  
00281A8A  jae         test_body<536870912,2>+51h (0281AA1h)  
        {
            array[x]++;
00281A8C  mov         eax,dword ptr [array]  
00281A8F  add         eax,dword ptr [ebp-8]  
00281A92  mov         cl,byte ptr [eax]  
00281A94  add         cl,1  
00281A97  mov         edx,dword ptr [array]  
00281A9A  add         edx,dword ptr [ebp-8]  
00281A9D  mov         byte ptr [edx],cl  
        }
00281A9F  jmp         test_body<536870912,2>+2Ah (0281A7Ah)  
    }
00281AA1  jmp         test_body<536870912,2>+12h (0281A62h)

最佳答案

TL; DR:您的测试为，对于缓存延迟或速度而言，测试不正确。相反，它测量了一些通过OoO CPU管道砍断复杂代码的问题。

使用正确的测试来测量高速缓存和内存延迟:lat_mem_rd from lmbench;进行速度(带宽)测量的正确测试:STREAM benchmark进行内存速度测试； tests from memtest86表示使用 rep movsl main operation的缓存速度)

此外，在现代(2010年及更新版本)台式机/服务器CPU中，在L1和L2高速缓存附近内置了硬件预取逻辑，它将检测线性访问模式并将数据从外部高速缓存预加载到内部，然后再请求此数据:Intel Optimization Manual - 7.2 Hardware prefetching of data ，第365页； intel.com blog, 2009。很难禁用所有硬件预取(SO Q/A 1，SO Q/A 2)

很长的故事:

我将尝试使用Linux中的perf性能监视工具(也称为perf_events)进行几次类似测试的测量。该代码基于Joky(32位int数组，而不是chars)中的程序，并分成几个二进制文件，如下所示:a5的大小为2 ^ 5 = 32； a10 => 2 ^ 10 = 1024(4 KB); a15 => 2 ^ 15 = 32768，a20(100万个整数= 4 MB)和a25(3200万个整数= 128MB)。 CPU是i7-2600四核Sandy Bridge 3.4 GHz。

让我们从具有默认事件集的基本perf stat开始(跳过一些行)。我选择了2 ^ 10(4 KB)和2 ^ 20(4 MB)

$ perf stat ./a10
Size=1024 ITERATIONS=1048576, TIME=2372.09 ms

 Performance counter stats for './a10':

               276 page-faults               #    0,000 M/sec
     8 238 473 169 cycles                    #    3,499 GHz
     4 936 244 310 stalled-cycles-frontend   #   59,92% frontend cycles idle
       415 849 629 stalled-cycles-backend    #    5,05% backend  cycles idle
    11 832 421 238 instructions              #    1,44  insns per cycle
                                             #    0,42  stalled cycles per insn
     1 078 974 782 branches                  #  458,274 M/sec
         1 080 091 branch-misses             #    0,10% of all branches

$ perf stat ./a20
Size=1048576 ITERATIONS=1024, TIME=2432.4 ms

 Performance counter stats for './a20':

             2 321 page-faults               #    0,001 M/sec
     8 487 656 735 cycles                    #    3,499 GHz
     5 184 295 720 stalled-cycles-frontend   #   61,08% frontend cycles idle
       663 245 253 stalled-cycles-backend    #    7,81% backend  cycles idle
    11 836 712 988 instructions              #    1,39  insns per cycle
                                             #    0,44  stalled cycles per insn
     1 077 257 745 branches                  #  444,104 M/sec
            30 601 branch-misses             #    0,00% of all branches

我们在这里可以看到什么？指令计数非常接近(因为Size * Iterations是常数)，周期计数和时间也很接近。这两个示例均具有10亿个分支，具有99％的良好预测。但是前端有60％的失速计数，后端有5-8％的失速计数。前端停顿是指令获取和解码中的停顿，很难说清原因，但是对于您的代码而言，前端每滴答声无法解码4条指令( Intel optimisation manual 的第B-41页，第B.3节-“性能调优“Sandy Bridge技术”，B.3.2小节自上而下的分层性能表征...)

$ perf record -e stalled-cycles-frontend ./a20
Size=1048576 ITERATIONS=1024, TIME=2477.65 ms
[ perf record: Woken up 1 times to write data ]
[ perf record: Captured and wrote 0.097 MB perf.data (~4245 samples) ]
$ perf annotate -d a20|cat
 Percent |      Source code & Disassembly of a20
------------------------------------------------
         :      08048e6f <void test_body<1048576u, 1024u>(int volatile*)>:

   10.43 :       8048e87:       mov    -0x8(%ebp),%eax
    1.10 :       8048e8a:       lea    0x0(,%eax,4),%edx
    0.16 :       8048e91:       mov    0x8(%ebp),%eax
    0.78 :       8048e94:       add    %edx,%eax
    6.87 :       8048e96:       mov    (%eax),%edx
   52.53 :       8048e98:       add    $0x1,%edx
    9.89 :       8048e9b:       mov    %edx,(%eax)
   14.15 :       8048e9d:       addl   $0x1,-0x8(%ebp)
    2.66 :       8048ea1:       mov    -0x8(%ebp),%eax
    1.39 :       8048ea4:       cmp    $0xfffff,%eax

或者在这里使用原始操作码( objdump -d)，其中一些具有相当复杂的索引编制，因此可能无法由3个简单的解码器处理它们并等待唯一的复杂解码器(图像在那里: http://www.realworldtech.com/sandy-bridge/4/)

 8048e87:       8b 45 f8                mov    -0x8(%ebp),%eax
 8048e8a:       8d 14 85 00 00 00 00    lea    0x0(,%eax,4),%edx
 8048e91:       8b 45 08                mov    0x8(%ebp),%eax
 8048e94:       01 d0                   add    %edx,%eax
 8048e96:       8b 10                   mov    (%eax),%edx
 8048e98:       83 c2 01                add    $0x1,%edx
 8048e9b:       89 10                   mov    %edx,(%eax)
 8048e9d:       83 45 f8 01             addl   $0x1,-0x8(%ebp)
 8048ea1:       8b 45 f8                mov    -0x8(%ebp),%eax
 8048ea4:       3d ff ff 0f 00          cmp    $0xfffff,%eax

后端停顿是通过等待内存或高速缓存(测量高速缓存时您感兴趣的事情)和内部执行核心停顿而创建的停顿:

$ perf record -e stalled-cycles-backend ./a20
Size=1048576 ITERATIONS=1024, TIME=2480.09 ms
[ perf record: Woken up 1 times to write data ]
[ perf record: Captured and wrote 0.095 MB perf.data (~4149 samples) ]
$ perf annotate -d a20|cat
    4.25 :       8048e96:       mov    (%eax),%edx
   58.68 :       8048e98:       add    $0x1,%edx
    8.86 :       8048e9b:       mov    %edx,(%eax)
    3.94 :       8048e9d:       addl   $0x1,-0x8(%ebp)
    7.66 :       8048ea1:       mov    -0x8(%ebp),%eax
    7.40 :       8048ea4:       cmp    $0xfffff,%eax

报告了大多数后端停顿的 add 0x1,%edx，因为它是数据的使用者，是从上一条命令中的数组加载的。通过存储到阵列，它们占后端停顿的70％，或者如果我们乘以程序中后端停顿的总份额(7％)，则为 的所有停顿的5％。或换句话说， cache is faster比您的程序。现在我们可以回答您的第一个问题:

Why the time taken does not increase at all regardless of the size of my array?

您的测试是如此糟糕(未优化)，以至于您试图测量缓存，但是它们的总运行时间仅降低了5％。您的测试是如此不稳定(嘈杂)，您将看不到5％的效果。

使用自定义的 perf stat运行，我们还可以测量缓存请求/未命中率。对于4 KB程序，L1数据高速缓存服务于所有负载的99,99％和所有存储的99,999％。我们可以注意到，不正确的测试所产生的缓存请求比在数组上遍历并增加每个元素(10亿个加载+ 10亿个存储)所需的请求多出两倍。其他访问用于使用局部变量(例如 x)，它们始终由缓存提供服务，因为它们的地址是恒定的)

$ perf stat -e 'L1-dcache-loads,L1-dcache-load-misses,L1-dcache-stores,L1-dcache-store-misses' ./a10
Size=1024 ITERATIONS=1048576, TIME=2412.25 ms

 Performance counter stats for './a10':

     5 375 195 765 L1-dcache-loads
           364 140 L1-dcache-load-misses     #    0,01% of all L1-dcache hits
     2 151 408 053 L1-dcache-stores
            13 350 L1-dcache-store-misses

对于4 MB的程序，命中率要差很多倍。错过次数增加100倍!现在，所有内存请求的1.2％不是由L1服务，而是由L2服务。

$ perf stat -e 'L1-dcache-loads,L1-dcache-load-misses,L1-dcache-stores,L1-dcache-store-misses' ./a20
Size=1048576 ITERATIONS=1024, TIME=2443.92 ms

 Performance counter stats for './a20':

     5 378 035 007 L1-dcache-loads
        67 725 008 L1-dcache-load-misses     #    1,26% of all L1-dcache hits
     2 152 183 588 L1-dcache-stores
        67 266 426 L1-dcache-store-misses

当我们要注意缓存延迟如何变为 from 4 cpu ticks up to 12(长3倍)，此更改仅影响1.2％的缓存请求以及我们的程序对缓存延迟敏感的仅7％的减速时，不是这种情况吗？？

如果我们将使用更大的数据集怎么办？好的，这是a25(4字节整数的2 ^ 25 = 128 MB，是缓存大小的几倍):

$ perf stat   ./a25
Size=134217728 ITERATIONS=8, TIME=2437.25 ms

 Performance counter stats for './a25':

           262 417 page-faults               #    0,090 M/sec
    10 214 588 827 cycles                    #    3,499 GHz
     6 272 114 853 stalled-cycles-frontend   #   61,40% frontend cycles idle
     1 098 632 880 stalled-cycles-backend    #   10,76% backend  cycles idle
    13 683 671 982 instructions              #    1,34  insns per cycle
                                             #    0,46  stalled cycles per insn
     1 274 410 549 branches                  #  436,519 M/sec
           315 656 branch-misses             #    0,02% of all branches

$ perf stat -e 'L1-dcache-loads,L1-dcache-load-misses,L1-dcache-stores,L1-dcache-store-misses' ./a25
Size=134217728 ITERATIONS=8, TIME=2444.13 ms

 Performance counter stats for './a25':

     6 138 410 226 L1-dcache-loads
        77 025 747 L1-dcache-load-misses     #    1,25% of all L1-dcache hits
     2 515 141 824 L1-dcache-stores
        76 320 695 L1-dcache-store-misses

L1丢失率几乎相同，并且更多后端停滞。我能够获取“cache-references，cache-misses”事件的统计信息，我建议它们与L3缓存有关(对L2的请求要多几倍):

$ perf stat -e 'cache-references,cache-misses' ./a25
Size=134217728 ITERATIONS=8, TIME=2440.71 ms

 Performance counter stats for './a25':

        17 053 482 cache-references
        11 829 118 cache-misses              #   69,365 % of all cache refs

因此，未命中率很高，但是该测试可处理10亿个(有用)负载，其中只有0.8亿个未命中L1。内存处理了1亿个请求。内存延迟约为 230 cpu clocks，而不是4时钟L1延迟。测试可以看到吗？可能是，如果噪音很低。

关于C++缓存性能奇怪的行为，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/23282744/

文章推荐： linux - 如果内核是交叉编译的，我必须交叉编译模块吗？

文章推荐： c - 了解代码 Linux 设备驱动程序 m25p80.c

java - JSR 107 - 缓存 (JCache) 与 CPU 缓存
我阅读了有关 JSR 107 缓存 (JCache) 的内容。我很困惑:据我所知，每个 CPU 都管理其缓存内存(无需操作系统的任何帮助)。那么，为什么我们需要 Java 缓存处理程序？ (如果C
jquery - 使用 jQuery 缓存，缓存 jQuery Sortable 对象
好吧，我是 jQuery 的新手。我一直在这里和那里搞乱一点点并习惯它。我终于明白了(它并不像某些人想象的那么难)。因此，鉴于此链接:http://jqueryui.com/sortable/#dis
hibernate 缓存？
我正在使用 Struts 2 和 Hibernate。我有一个简单的表，其中包含一个日期字段，用于存储有关何时发生特定操作的信息。这个日期值显示在我的 jsp 中。我遇到的问题是hibernate更
缓存-修复浏览器本地缓存页面
我有点不确定这里发生了什么，但是我试图解释正在发生的事情，也许一旦我弄清楚我到底在问什么，就可能写一个更好的问题。我刚刚安装了Varnish，对于我的请求时间来说似乎很棒。这是一个Magneto 2
haskell 缓存
解决 Project Euler 的问题后，我在论坛中发现了以下 Haskell 代码: fillRow115 minLength = cache where cache = ((map fill
Python包代理/缓存
我正试图找到一种方法来为我网络上的每台计算机缓存或存储某些 python 包。我看过以下解决方案: pypicache但它不再被积极开发，作者推荐 devpi，请参见此处:https://bitbuc
缓存 WebSocket
我想到的一个问题是可以从一开始就缓存网络套接字吗？在我的拓扑中，我在通过双 ISP 连接连接到互联网的 HAProxy 服务器后面有 2 个 Apache 服务器(带有 Google PageSpee
Linux内存管理(缓存)
我很难说出不同缓存区域 (OS) 之间的区别。我想简要解释一下磁盘\缓冲区\交换\页面缓存。他们住在哪里？它们之间的主要区别是什么？据我了解，页面缓存是主内存的一部分，用于存储从 I/O 设备获取的
LeetCode_数据结构设计_困难_460. LFU 缓存
1.题目请你为最不经常使用（LFU）缓存算法设计并实现数据结构。实现 LFUCache 类： LFUCache(int capacity) - 用数据结构的容量 capacity 初始化对象 in
LeetCode_数据结构设计_中等_146. LRU 缓存
1.题目请你设计并实现一个满足 LRU (最近最少使用) 缓存约束的数据结构。实现 LRUCache 类： ① LRUCache(int capacity) 以正整数作为容量 capacity
Django 缓存 - 删除某些页面的缓存
我想在访问该 View 时关闭某些页面的缓存。它适用于简单查询模型对象的页面。好像什么时候 'django.middleware.cache.FetchFromCacheMiddleware', 启
WiX ExePackage 缓存
documents为 ExePackage element state Cache属性的目的是 Whether to cache the package. The default is "yes".
Docker 缓存，它是如何工作的？
我知道 docker 用图层存储每个图像。如果我在一台开发服务器上有多个用户，并且每个人都在运行相同的 Dockerfile，但将镜像存储为 user1_myapp . user2 将其存储为 use
Codeigniter - 缓存 - 服务器？
在 Codeigniter 中没有出现缓存问题几年后，我发现了一个问题。我在其他地方看到过该问题，但没有适合我的解决方案。例如，如果我在 View 中更改一些纯 html 文本并上传新文件并按 F5
caching - Janusgraph 缓存
我在 Janusgraph 文档中阅读了有关 Janusgraph Cache 的内容。关于事务缓存，我几乎没有怀疑。我在我的应用程序中使用嵌入式 janusgrah 服务器。如果我只对例如进行读取
javascript - 有没有办法从终端重新启动无效/缓存？
我想知道是否有来自终端的任何命令可以用来匹配 Android Studio 中执行文件>使缓存无效/重新启动的使用。谢谢! 最佳答案 According to a JetBrains employe
python - 带有默认可选参数的内存/缓存
我想制作一个 python 装饰器来内存函数。例如，如果 @memoization_decorator def add(a, b, negative=False): print "Com
jquery - 缓存 $(this) 是否会带来性能提升？
我经常在 jQuery 事件处理程序中使用 $(this) 并且从不缓存它。如果我愿意的话 var $this = $(this); 并且将使用变量而不是构造函数，我的代码会获得任何显着的额外性能吗？
使用模式匹配禁止 Varnish 缓存
是的，我要说实话，我不知道varnish vcl，我可以解决一些基本问题，但是我不太清楚，这就是为什么我遇到问题了。我正在尝试通过http请求设置缓存禁止，但是该请求不能通过DNS而是通过 Varn
Varnish 缓存-无法处理4000个并发用户
在 WP 站点上加载约 4000 个并发用户时遇到此问题。这是我的配置: F5 负载均衡器 ---> Varnish 4，8 核，32 Gb RAM ---> 9 个后端，4 个核，每个 16 RA

塔克拉玛干

个人简介

我是一名优秀的程序员,十分优秀！

作者热门文章

滴滴打车优惠券免费领取

全站热门文章

首页

博学

6Ren·AI

商城

C++缓存性能奇怪的行为