gpt4 book ai didi

c++ - 您如何解释缓存未命中的 cachegrind 输出?

转载 作者:IT老高 更新时间:2023-10-28 22:20:20 24 4
gpt4 key购买 nike

出于好奇,我编写了几个不同版本的矩阵乘法并针对它运行了 cachegrind。在下面的结果中,我想知道哪些部分是 L1、L2、L3 未命中和引用,它们的真正含义是什么?下面是我的矩阵乘法代码,以防万一有人需要。

#define SLOWEST
==6933== Cachegrind, a cache and branch-prediction profiler
==6933== Copyright (C) 2002-2012, and GNU GPL'd, by Nicholas Nethercote et al.
==6933== Using Valgrind-3.8.1 and LibVEX; rerun with -h for copyright info
==6933== Command: ./a.out 500
==6933==
--6933-- warning: L3 cache found, using its data for the LL simulation.
--6933-- warning: pretending that LL cache has associativity 24 instead of actual 16
Multiplied matrix A and B in 60.7487 seconds.
==6933==
==6933== I refs: 6,039,791,314
==6933== I1 misses: 1,611
==6933== LLi misses: 1,519
==6933== I1 miss rate: 0.00%
==6933== LLi miss rate: 0.00%
==6933==
==6933== D refs: 2,892,704,678 (2,763,005,485 rd + 129,699,193 wr)
==6933== D1 misses: 136,223,560 ( 136,174,705 rd + 48,855 wr)
==6933== LLd misses: 53,675 ( 5,247 rd + 48,428 wr)
==6933== D1 miss rate: 4.7% ( 4.9% + 0.0% )
==6933== LLd miss rate: 0.0% ( 0.0% + 0.0% )
==6933==
==6933== LL refs: 136,225,171 ( 136,176,316 rd + 48,855 wr)
==6933== LL misses: 55,194 ( 6,766 rd + 48,428 wr)
==6933== LL miss rate: 0.0% ( 0.0% + 0.0% )

#define SLOWER
==8463== Cachegrind, a cache and branch-prediction profiler
==8463== Copyright (C) 2002-2012, and GNU GPL'd, by Nicholas Nethercote et al.
==8463== Using Valgrind-3.8.1 and LibVEX; rerun with -h for copyright info
==8463== Command: ./a.out 500
==8463==
--8463-- warning: L3 cache found, using its data for the LL simulation.
--8463-- warning: pretending that LL cache has associativity 24 instead of actual 16
Multiplied matrix A and B in 49.7397 seconds.
==8463==
==8463== I refs: 4,537,213,120
==8463== I1 misses: 1,571
==8463== LLi misses: 1,487
==8463== I1 miss rate: 0.00%
==8463== LLi miss rate: 0.00%
==8463==
==8463== D refs: 2,891,485,608 (2,761,862,312 rd + 129,623,296 wr)
==8463== D1 misses: 59,961,522 ( 59,913,256 rd + 48,266 wr)
==8463== LLd misses: 53,113 ( 5,246 rd + 47,867 wr)
==8463== D1 miss rate: 2.0% ( 2.1% + 0.0% )
==8463== LLd miss rate: 0.0% ( 0.0% + 0.0% )
==8463==
==8463== LL refs: 59,963,093 ( 59,914,827 rd + 48,266 wr)
==8463== LL misses: 54,600 ( 6,733 rd + 47,867 wr)
==8463== LL miss rate: 0.0% ( 0.0% + 0.0% )

#define SLOW
==9174== Cachegrind, a cache and branch-prediction profiler
==9174== Copyright (C) 2002-2012, and GNU GPL'd, by Nicholas Nethercote et al.
==9174== Using Valgrind-3.8.1 and LibVEX; rerun with -h for copyright info
==9174== Command: ./a.out 500
==9174==
--9174-- warning: L3 cache found, using its data for the LL simulation.
--9174-- warning: pretending that LL cache has associativity 24 instead of actual 16
Multiplied matrix A and B in 35.8901 seconds.
==9174==
==9174== I refs: 3,039,713,059
==9174== I1 misses: 1,570
==9174== LLi misses: 1,486
==9174== I1 miss rate: 0.00%
==9174== LLi miss rate: 0.00%
==9174==
==9174== D refs: 1,893,235,586 (1,763,112,301 rd + 130,123,285 wr)
==9174== D1 misses: 63,285,950 ( 62,987,684 rd + 298,266 wr)
==9174== LLd misses: 53,113 ( 5,246 rd + 47,867 wr)
==9174== D1 miss rate: 3.3% ( 3.5% + 0.2% )
==9174== LLd miss rate: 0.0% ( 0.0% + 0.0% )
==9174==
==9174== LL refs: 63,287,520 ( 62,989,254 rd + 298,266 wr)
==9174== LL misses: 54,599 ( 6,732 rd + 47,867 wr)
==9174== LL miss rate: 0.0% ( 0.0% + 0.0% )

#define MEDIUM
==7838== Cachegrind, a cache and branch-prediction profiler
==7838== Copyright (C) 2002-2012, and GNU GPL'd, by Nicholas Nethercote et al.
==7838== Using Valgrind-3.8.1 and LibVEX; rerun with -h for copyright info
==7838== Command: ./a.out 500
==7838==
--7838-- warning: L3 cache found, using its data for the LL simulation.
--7838-- warning: pretending that LL cache has associativity 24 instead of actual 16
Multiplied matrix A and B in 23.4097 seconds.
==7838==
==7838== I refs: 2,548,967,151
==7838== I1 misses: 1,610
==7838== LLi misses: 1,522
==7838== I1 miss rate: 0.00%
==7838== LLi miss rate: 0.00%
==7838==
==7838== D refs: 1,399,237,303 (1,267,363,440 rd + 131,873,863 wr)
==7838== D1 misses: 592,807 ( 293,091 rd + 299,716 wr)
==7838== LLd misses: 53,147 ( 5,248 rd + 47,899 wr)
==7838== D1 miss rate: 0.0% ( 0.0% + 0.2% )
==7838== LLd miss rate: 0.0% ( 0.0% + 0.0% )
==7838==
==7838== LL refs: 594,417 ( 294,701 rd + 299,716 wr)
==7838== LL misses: 54,669 ( 6,770 rd + 47,899 wr)
==7838== LL miss rate: 0.0% ( 0.0% + 0.0% )

#define MEDIUMISH
==8438== Cachegrind, a cache and branch-prediction profiler
==8438== Copyright (C) 2002-2012, and GNU GPL'd, by Nicholas Nethercote et al.
==8438== Using Valgrind-3.8.1 and LibVEX; rerun with -h for copyright info
==8438== Command: ./a.out 500
==8438==
--8438-- warning: L3 cache found, using its data for the LL simulation.
--8438-- warning: pretending that LL cache has associativity 24 instead of actual 16
Multiplied matrix A and B in 24.0327 seconds.
==8438==
==8438== I refs: 2,550,211,553
==8438== I1 misses: 1,576
==8438== LLi misses: 1,488
==8438== I1 miss rate: 0.00%
==8438== LLi miss rate: 0.00%
==8438==
==8438== D refs: 1,400,107,343 (1,267,610,303 rd + 132,497,040 wr)
==8438== D1 misses: 339,977 ( 42,583 rd + 297,394 wr)
==8438== LLd misses: 53,114 ( 5,248 rd + 47,866 wr)
==8438== D1 miss rate: 0.0% ( 0.0% + 0.2% )
==8438== LLd miss rate: 0.0% ( 0.0% + 0.0% )
==8438==
==8438== LL refs: 341,553 ( 44,159 rd + 297,394 wr)
==8438== LL misses: 54,602 ( 6,736 rd + 47,866 wr)
==8438== LL miss rate: 0.0% ( 0.0% + 0.0% )

矩阵乘法代码。

#if defined(SLOWEST)
void multiply (float **A, float **B, float **out, int size) {
for (int row=0;row<size;row++)
for (int col=0;col<size;col++)
for (int in=0;in<size;in++)
out[row][col] += A[row][in] * B[in][col];
}
// Takes in 1-D arrays, same as before.
#elif defined(SLOWER)
void multiply (float *A, float *B, float *out, int size) {
for (int row=0;row<size;row++)
for (int col=0;col<size;col++)
for (int in=0;in<size;in++)
out[row * size + col] += A[row * size + in] * B[in * size + col];
}
// Flips first and second loops
#elif defined(SLOW)
void multiply (float *A, float *B, float *out, int size) {
for (int col=0;col<size;col++)
for (int row=0;row<size;row++) {
float curr = 0; // prevents from calculating position each time through
for (int in=0;in<size;in++)
curr += A[row * size + in] * B[in *size + col];
out[row * size + col] = curr;
}
}
#elif defined(MEDIUM)
// Keeps it organized for future codes.
float dotProduct(float *A, float *B, int size) {
float curr = 0;

for (int i=0;i<size;i++)
curr += A[i] * B[i];

return curr;
}
void multiply (float *A, float *B, float *out, int size) {
float *temp = new float[size];

for (int col=0;col<size;col++) {
for (int i=0;i<size;i++) // stores column into sequential array
temp[i] = B[i * size + col];
for (int row=0;row<size;row++)
out[row * size + col] = dotProduct(&A[row], temp, size); // uses function above for dot product.
}

delete[] temp;
}
#elif defined(MEDIUMISH)
float dotProduct(float *A, float *B, int size) {
float curr = 0;

for (int i=0;i<size;i++)
curr += A[i] * B[i];

return curr;
}
void multiply (float *A, float *B, float *out, int size) {
for (int i=0;i<size-1;i++)
for (int j=i+1;j<size;j++)
std::swap(B[i * size + j], B[j * size + i]);

for (int col=0;col<size;col++)
for (int row=0;row<size;row++)
out[row * size + col] = dotProduct(&A[row], &B[row], size); // uses function above for dot product.
}
#elif defined(FAST)

#elif defined(FASTER)

#endif

最佳答案

根据documentation cachegrind 只模拟一级和末级缓存:

Cachegrind simulates how your program interacts with a machine's cache hierarchy and (optionally) branch predictor. It simulates a machine with independent first-level instruction and data caches (I1 and D1), backed by a unified second-level cache (L2). This exactly matches the configuration of many modern machines.

However, some modern machines have three or four levels of cache. For these machines (in the cases where Cachegrind can auto-detect the cache configuration) Cachegrind simulates the first-level and last-level caches. The reason for this choice is that the last-level cache has the most influence on runtime, as it masks accesses to main memory. Furthermore, the L1 caches often have low associativity, so simulating them can detect cases where the code interacts badly with this cache (eg. traversing a matrix column-wise with the row length being a power of 2).

这意味着您无法获得 L2 信息,而在您的情况下只能获得 L1 和 L3。

cachegrind 输出的第一部分报告有关 L1 指令缓存的信息。在您的所有示例中,L1 指令缓存未命中的数量微不足道,未命中率始终为 0%。这意味着您的所有程序都适合您的 L1 指令缓存。

输出的第二部分报告有关 L1 和 LL(最后一级缓存,在您的情况下为 L3)数据缓存的信息。使用 D1 未命中率: 信息您应该看到哪个版本的矩阵乘法算法是“缓存效率最高的”

cachegrind 输出的最后一部分总结了有关指令和数据的 LL(最后一级缓存,在您的情况下为 L3)的信息。因此,它给出了内存访问次数和缓存服务的内存请求百分比。

关于c++ - 您如何解释缓存未命中的 cachegrind 输出?,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/20172216/

24 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com