c++ - 您如何解释缓存未命中的 cachegrind 输出？-6ren

c++ - 您如何解释缓存未命中的 cachegrind 输出？

转载作者：IT老高更新时间：2023-10-28 22:20:20

出于好奇，我编写了几个不同版本的矩阵乘法并针对它运行了 cachegrind。在下面的结果中，我想知道哪些部分是 L1、L2、L3 未命中和引用，它们的真正含义是什么？下面是我的矩阵乘法代码，以防万一有人需要。

#define SLOWEST
==6933== Cachegrind, a cache and branch-prediction profiler
==6933== Copyright (C) 2002-2012, and GNU GPL'd, by Nicholas Nethercote et al.
==6933== Using Valgrind-3.8.1 and LibVEX; rerun with -h for copyright info
==6933== Command: ./a.out 500
==6933== 
--6933-- warning: L3 cache found, using its data for the LL simulation.
--6933-- warning: pretending that LL cache has associativity 24 instead of actual 16
Multiplied matrix A and B in 60.7487 seconds.
==6933== 
==6933== I   refs:      6,039,791,314
==6933== I1  misses:            1,611
==6933== LLi misses:            1,519
==6933== I1  miss rate:          0.00%
==6933== LLi miss rate:          0.00%
==6933== 
==6933== D   refs:      2,892,704,678  (2,763,005,485 rd   + 129,699,193 wr)
==6933== D1  misses:      136,223,560  (  136,174,705 rd   +      48,855 wr)
==6933== LLd misses:           53,675  (        5,247 rd   +      48,428 wr)
==6933== D1  miss rate:           4.7% (          4.9%     +         0.0%  )
==6933== LLd miss rate:           0.0% (          0.0%     +         0.0%  )
==6933== 
==6933== LL refs:         136,225,171  (  136,176,316 rd   +      48,855 wr)
==6933== LL misses:            55,194  (        6,766 rd   +      48,428 wr)
==6933== LL miss rate:            0.0% (          0.0%     +         0.0%  )

#define SLOWER
==8463== Cachegrind, a cache and branch-prediction profiler
==8463== Copyright (C) 2002-2012, and GNU GPL'd, by Nicholas Nethercote et al.
==8463== Using Valgrind-3.8.1 and LibVEX; rerun with -h for copyright info
==8463== Command: ./a.out 500
==8463== 
--8463-- warning: L3 cache found, using its data for the LL simulation.
--8463-- warning: pretending that LL cache has associativity 24 instead of actual 16
Multiplied matrix A and B in 49.7397 seconds.
==8463== 
==8463== I   refs:      4,537,213,120
==8463== I1  misses:            1,571
==8463== LLi misses:            1,487
==8463== I1  miss rate:          0.00%
==8463== LLi miss rate:          0.00%
==8463== 
==8463== D   refs:      2,891,485,608  (2,761,862,312 rd   + 129,623,296 wr)
==8463== D1  misses:       59,961,522  (   59,913,256 rd   +      48,266 wr)
==8463== LLd misses:           53,113  (        5,246 rd   +      47,867 wr)
==8463== D1  miss rate:           2.0% (          2.1%     +         0.0%  )
==8463== LLd miss rate:           0.0% (          0.0%     +         0.0%  )
==8463== 
==8463== LL refs:          59,963,093  (   59,914,827 rd   +      48,266 wr)
==8463== LL misses:            54,600  (        6,733 rd   +      47,867 wr)
==8463== LL miss rate:            0.0% (          0.0%     +         0.0%  )

#define SLOW
==9174== Cachegrind, a cache and branch-prediction profiler
==9174== Copyright (C) 2002-2012, and GNU GPL'd, by Nicholas Nethercote et al.
==9174== Using Valgrind-3.8.1 and LibVEX; rerun with -h for copyright info
==9174== Command: ./a.out 500
==9174== 
--9174-- warning: L3 cache found, using its data for the LL simulation.
--9174-- warning: pretending that LL cache has associativity 24 instead of actual 16
Multiplied matrix A and B in 35.8901 seconds.
==9174== 
==9174== I   refs:      3,039,713,059
==9174== I1  misses:            1,570
==9174== LLi misses:            1,486
==9174== I1  miss rate:          0.00%
==9174== LLi miss rate:          0.00%
==9174== 
==9174== D   refs:      1,893,235,586  (1,763,112,301 rd   + 130,123,285 wr)
==9174== D1  misses:       63,285,950  (   62,987,684 rd   +     298,266 wr)
==9174== LLd misses:           53,113  (        5,246 rd   +      47,867 wr)
==9174== D1  miss rate:           3.3% (          3.5%     +         0.2%  )
==9174== LLd miss rate:           0.0% (          0.0%     +         0.0%  )
==9174== 
==9174== LL refs:          63,287,520  (   62,989,254 rd   +     298,266 wr)
==9174== LL misses:            54,599  (        6,732 rd   +      47,867 wr)
==9174== LL miss rate:            0.0% (          0.0%     +         0.0%  )

#define MEDIUM
==7838== Cachegrind, a cache and branch-prediction profiler
==7838== Copyright (C) 2002-2012, and GNU GPL'd, by Nicholas Nethercote et al.
==7838== Using Valgrind-3.8.1 and LibVEX; rerun with -h for copyright info
==7838== Command: ./a.out 500
==7838== 
--7838-- warning: L3 cache found, using its data for the LL simulation.
--7838-- warning: pretending that LL cache has associativity 24 instead of actual 16
Multiplied matrix A and B in 23.4097 seconds.
==7838== 
==7838== I   refs:      2,548,967,151
==7838== I1  misses:            1,610
==7838== LLi misses:            1,522
==7838== I1  miss rate:          0.00%
==7838== LLi miss rate:          0.00%
==7838== 
==7838== D   refs:      1,399,237,303  (1,267,363,440 rd   + 131,873,863 wr)
==7838== D1  misses:          592,807  (      293,091 rd   +     299,716 wr)
==7838== LLd misses:           53,147  (        5,248 rd   +      47,899 wr)
==7838== D1  miss rate:           0.0% (          0.0%     +         0.2%  )
==7838== LLd miss rate:           0.0% (          0.0%     +         0.0%  )
==7838== 
==7838== LL refs:             594,417  (      294,701 rd   +     299,716 wr)
==7838== LL misses:            54,669  (        6,770 rd   +      47,899 wr)
==7838== LL miss rate:            0.0% (          0.0%     +         0.0%  )

#define MEDIUMISH
==8438== Cachegrind, a cache and branch-prediction profiler
==8438== Copyright (C) 2002-2012, and GNU GPL'd, by Nicholas Nethercote et al.
==8438== Using Valgrind-3.8.1 and LibVEX; rerun with -h for copyright info
==8438== Command: ./a.out 500
==8438== 
--8438-- warning: L3 cache found, using its data for the LL simulation.
--8438-- warning: pretending that LL cache has associativity 24 instead of actual 16
Multiplied matrix A and B in 24.0327 seconds.
==8438== 
==8438== I   refs:      2,550,211,553
==8438== I1  misses:            1,576
==8438== LLi misses:            1,488
==8438== I1  miss rate:          0.00%
==8438== LLi miss rate:          0.00%
==8438== 
==8438== D   refs:      1,400,107,343  (1,267,610,303 rd   + 132,497,040 wr)
==8438== D1  misses:          339,977  (       42,583 rd   +     297,394 wr)
==8438== LLd misses:           53,114  (        5,248 rd   +      47,866 wr)
==8438== D1  miss rate:           0.0% (          0.0%     +         0.2%  )
==8438== LLd miss rate:           0.0% (          0.0%     +         0.0%  )
==8438== 
==8438== LL refs:             341,553  (       44,159 rd   +     297,394 wr)
==8438== LL misses:            54,602  (        6,736 rd   +      47,866 wr)
==8438== LL miss rate:            0.0% (          0.0%     +         0.0%  )

矩阵乘法代码。

#if defined(SLOWEST)
    void multiply (float **A, float **B, float **out, int size) {
        for (int row=0;row<size;row++)
            for (int col=0;col<size;col++)
                for (int in=0;in<size;in++)
                    out[row][col] += A[row][in] * B[in][col];
    }
// Takes in 1-D arrays, same as before.
#elif defined(SLOWER)
    void multiply (float *A, float *B, float *out, int size) {
        for (int row=0;row<size;row++)
            for (int col=0;col<size;col++)
                for (int in=0;in<size;in++)
                    out[row * size + col] += A[row * size + in] * B[in * size + col];
    }
// Flips first and second loops
#elif defined(SLOW)
    void multiply (float *A, float *B, float *out, int size) {
        for (int col=0;col<size;col++)
            for (int row=0;row<size;row++) {
                float curr = 0;  // prevents from calculating position each time through
                for (int in=0;in<size;in++)
                    curr += A[row * size + in] * B[in *size + col];
                out[row * size + col] = curr;
            }
    }
#elif defined(MEDIUM)
    // Keeps it organized for future codes.
    float dotProduct(float *A, float *B, int size) {
        float curr = 0;

        for (int i=0;i<size;i++)
            curr += A[i] * B[i];

        return curr;
    }
    void multiply (float *A, float *B, float *out, int size) {
        float *temp = new float[size];

        for (int col=0;col<size;col++) {
            for (int i=0;i<size;i++)  // stores column into sequential array
                temp[i] = B[i * size + col];
            for (int row=0;row<size;row++)
                out[row * size + col] = dotProduct(&A[row], temp, size);  // uses function above for dot product.
        }

        delete[] temp;
    }
#elif defined(MEDIUMISH)
    float dotProduct(float *A, float *B, int size) {
        float curr = 0;

        for (int i=0;i<size;i++)
            curr += A[i] * B[i];

        return curr;
    }
    void multiply (float *A, float *B, float *out, int size) {
        for (int i=0;i<size-1;i++)
            for (int j=i+1;j<size;j++)
                std::swap(B[i * size + j], B[j * size + i]);

        for (int col=0;col<size;col++)
            for (int row=0;row<size;row++)
                out[row * size + col] = dotProduct(&A[row], &B[row], size);  // uses function above for dot product.
    }
#elif defined(FAST)

#elif defined(FASTER)

#endif

最佳答案

根据documentation cachegrind 只模拟一级和末级缓存:

Cachegrind simulates how your program interacts with a machine's cache hierarchy and (optionally) branch predictor. It simulates a machine with independent first-level instruction and data caches (I1 and D1), backed by a unified second-level cache (L2). This exactly matches the configuration of many modern machines.

However, some modern machines have three or four levels of cache. For these machines (in the cases where Cachegrind can auto-detect the cache configuration) Cachegrind simulates the first-level and last-level caches. The reason for this choice is that the last-level cache has the most influence on runtime, as it masks accesses to main memory. Furthermore, the L1 caches often have low associativity, so simulating them can detect cases where the code interacts badly with this cache (eg. traversing a matrix column-wise with the row length being a power of 2).

这意味着您无法获得 L2 信息，而在您的情况下只能获得 L1 和 L3。

cachegrind 输出的第一部分报告有关 L1 指令缓存的信息。在您的所有示例中，L1 指令缓存未命中的数量微不足道，未命中率始终为 0%。这意味着您的所有程序都适合您的 L1 指令缓存。

输出的第二部分报告有关 L1 和 LL(最后一级缓存，在您的情况下为 L3)数据缓存的信息。使用 D1 未命中率: 信息您应该看到哪个版本的矩阵乘法算法是“缓存效率最高的”

cachegrind 输出的最后一部分总结了有关指令和数据的 LL(最后一级缓存，在您的情况下为 L3)的信息。因此，它给出了内存访问次数和缓存服务的内存请求百分比。

关于c++ - 您如何解释缓存未命中的 cachegrind 输出？，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/20172216/

文章推荐： c++ - std::make_shared() 是否使用自定义分配器？

文章推荐： python - Python中的静态类

文章推荐： python - 在 python 类上覆盖 __dict__()

文章推荐： c++ - 为什么在没有参数的情况下调用省略号而不是可变参数模板？

java - JSR 107 - 缓存 (JCache) 与 CPU 缓存
我阅读了有关 JSR 107 缓存 (JCache) 的内容。我很困惑:据我所知，每个 CPU 都管理其缓存内存(无需操作系统的任何帮助)。那么，为什么我们需要 Java 缓存处理程序？ (如果C
jquery - 使用 jQuery 缓存，缓存 jQuery Sortable 对象
好吧，我是 jQuery 的新手。我一直在这里和那里搞乱一点点并习惯它。我终于明白了(它并不像某些人想象的那么难)。因此，鉴于此链接:http://jqueryui.com/sortable/#dis
hibernate 缓存？
我正在使用 Struts 2 和 Hibernate。我有一个简单的表，其中包含一个日期字段，用于存储有关何时发生特定操作的信息。这个日期值显示在我的 jsp 中。我遇到的问题是hibernate更
缓存-修复浏览器本地缓存页面
我有点不确定这里发生了什么，但是我试图解释正在发生的事情，也许一旦我弄清楚我到底在问什么，就可能写一个更好的问题。我刚刚安装了Varnish，对于我的请求时间来说似乎很棒。这是一个Magneto 2
haskell 缓存
解决 Project Euler 的问题后，我在论坛中发现了以下 Haskell 代码: fillRow115 minLength = cache where cache = ((map fill
Python包代理/缓存
我正试图找到一种方法来为我网络上的每台计算机缓存或存储某些 python 包。我看过以下解决方案: pypicache但它不再被积极开发，作者推荐 devpi，请参见此处:https://bitbuc
缓存 WebSocket
我想到的一个问题是可以从一开始就缓存网络套接字吗？在我的拓扑中，我在通过双 ISP 连接连接到互联网的 HAProxy 服务器后面有 2 个 Apache 服务器(带有 Google PageSpee
Linux内存管理(缓存)
我很难说出不同缓存区域 (OS) 之间的区别。我想简要解释一下磁盘\缓冲区\交换\页面缓存。他们住在哪里？它们之间的主要区别是什么？据我了解，页面缓存是主内存的一部分，用于存储从 I/O 设备获取的
LeetCode_数据结构设计_困难_460. LFU 缓存
1.题目请你为最不经常使用（LFU）缓存算法设计并实现数据结构。实现 LFUCache 类： LFUCache(int capacity) - 用数据结构的容量 capacity 初始化对象 in
LeetCode_数据结构设计_中等_146. LRU 缓存
1.题目请你设计并实现一个满足 LRU (最近最少使用) 缓存约束的数据结构。实现 LRUCache 类： ① LRUCache(int capacity) 以正整数作为容量 capacity
Django 缓存 - 删除某些页面的缓存
我想在访问该 View 时关闭某些页面的缓存。它适用于简单查询模型对象的页面。好像什么时候 'django.middleware.cache.FetchFromCacheMiddleware', 启
WiX ExePackage 缓存
documents为 ExePackage element state Cache属性的目的是 Whether to cache the package. The default is "yes".
Docker 缓存，它是如何工作的？
我知道 docker 用图层存储每个图像。如果我在一台开发服务器上有多个用户，并且每个人都在运行相同的 Dockerfile，但将镜像存储为 user1_myapp . user2 将其存储为 use
Codeigniter - 缓存 - 服务器？
在 Codeigniter 中没有出现缓存问题几年后，我发现了一个问题。我在其他地方看到过该问题，但没有适合我的解决方案。例如，如果我在 View 中更改一些纯 html 文本并上传新文件并按 F5
caching - Janusgraph 缓存
我在 Janusgraph 文档中阅读了有关 Janusgraph Cache 的内容。关于事务缓存，我几乎没有怀疑。我在我的应用程序中使用嵌入式 janusgrah 服务器。如果我只对例如进行读取
javascript - 有没有办法从终端重新启动无效/缓存？
我想知道是否有来自终端的任何命令可以用来匹配 Android Studio 中执行文件>使缓存无效/重新启动的使用。谢谢! 最佳答案 According to a JetBrains employe
python - 带有默认可选参数的内存/缓存
我想制作一个 python 装饰器来内存函数。例如，如果 @memoization_decorator def add(a, b, negative=False): print "Com
jquery - 缓存 $(this) 是否会带来性能提升？
我经常在 jQuery 事件处理程序中使用 $(this) 并且从不缓存它。如果我愿意的话 var $this = $(this); 并且将使用变量而不是构造函数，我的代码会获得任何显着的额外性能吗？
使用模式匹配禁止 Varnish 缓存
是的，我要说实话，我不知道varnish vcl，我可以解决一些基本问题，但是我不太清楚，这就是为什么我遇到问题了。我正在尝试通过http请求设置缓存禁止，但是该请求不能通过DNS而是通过 Varn
Varnish 缓存-无法处理4000个并发用户
在 WP 站点上加载约 4000 个并发用户时遇到此问题。这是我的配置: F5 负载均衡器 ---> Varnish 4，8 核，32 Gb RAM ---> 9 个后端，4 个核，每个 16 RA

IT老高

个人简介

我是一名优秀的程序员,十分优秀！

作者热门文章

滴滴打车优惠券免费领取

全站热门文章

首页

博学

6Ren·AI

商城

c++ - 您如何解释缓存未命中的 cachegrind 输出？