gpt4 book ai didi

c - L1 缓存未命中的成本是多少?

转载 作者:太空狗 更新时间:2023-10-29 16:16:17 24 4
gpt4 key购买 nike

编辑:出于引用目的(如果有人偶然发现这个问题),Igor Ostrovsky 写了一个 great post关于缓存未命中。它讨论了几个不同的问题并显示了示例编号。 结束编辑

我做了一些测试 <long story goes here>并且想知道性能差异是否是由于内存缓存未命中造成的。以下代码演示了该问题并将其归结为关键时序部分。下面的代码有几个循环,它们以随机顺序访问内存,然后以升序地址顺序访问内存。

我在 XP 机器(使用 VS2005 编译:cl/O2)和 Linux 机器 (gcc –Os) 上运行它。两者都产生了相似的时间。这些时间以毫秒为单位。我相信所有循环都在运行并且没有优化(否则它会“立即”运行)。

*** Testing 20000 nodesTotal Ordered Time: 888.822899Total Random Time: 2155.846268

Do these numbers make sense? Is the difference primarily due to L1 cache misses or is something else going on as well? There are 20,000^2 memory accesses and if every one were a cache miss, that is about 3.2 nanoseconds per miss. The XP (P4) machine I tested on is 3.2GHz and I suspect (but don’t know) has a 32KB L1 cache and 512KB L2. With 20,000 entries (80KB), I assume there is not a significant number of L2 misses. So this would be (3.2*10^9 cycles/second) * 3.2*10^-9 seconds/miss) = 10.1 cycles/miss. That seems high to me. Maybe it’s not, or maybe my math is bad. I tried measuring cache misses with VTune, but I got a BSOD. And now I can’t get it to connect to the license server (grrrr).

typedef struct stItem
{
long lData;
//char acPad[20];
} LIST_NODE;



#if defined( WIN32 )
void StartTimer( LONGLONG *pt1 )
{
QueryPerformanceCounter( (LARGE_INTEGER*)pt1 );
}

void StopTimer( LONGLONG t1, double *pdMS )
{
LONGLONG t2, llFreq;

QueryPerformanceCounter( (LARGE_INTEGER*)&t2 );
QueryPerformanceFrequency( (LARGE_INTEGER*)&llFreq );
*pdMS = ((double)( t2 - t1 ) / (double)llFreq) * 1000.0;
}
#else
// doesn't need 64-bit integer in this case
void StartTimer( LONGLONG *pt1 )
{
// Just use clock(), this test doesn't need higher resolution
*pt1 = clock();
}

void StopTimer( LONGLONG t1, double *pdMS )
{
LONGLONG t2 = clock();
*pdMS = (double)( t2 - t1 ) / ( CLOCKS_PER_SEC / 1000 );
}
#endif



long longrand()
{
#if defined( WIN32 )
// Stupid cheesy way to make sure it is not just a 16-bit rand value
return ( rand() << 16 ) | rand();
#else
return rand();
#endif
}

// get random value in the given range
int randint( int m, int n )
{
int ret = longrand() % ( n - m + 1 );
return ret + m;
}

// I think I got this out of Programming Pearls (Bentley).
void ShuffleArray
(
long *plShuffle, // (O) return array of "randomly" ordered integers
long lNumItems // (I) length of array
)
{
long i;
long j;
long t;

for ( i = 0; i < lNumItems; i++ )
plShuffle[i] = i;

for ( i = 0; i < lNumItems; i++ )
{
j = randint( i, lNumItems - 1 );

t = plShuffle[i];
plShuffle[i] = plShuffle[j];
plShuffle[j] = t;
}
}



int main( int argc, char* argv[] )
{
long *plDataValues;
LIST_NODE *pstNodes;
long lNumItems = 20000;
long i, j;
LONGLONG t1; // for timing
double dms;

if ( argc > 1 && atoi(argv[1]) > 0 )
lNumItems = atoi( argv[1] );

printf( "\n\n*** Testing %u nodes\n", lNumItems );

srand( (unsigned int)time( 0 ));

// allocate the nodes as one single chunk of memory
pstNodes = (LIST_NODE*)malloc( lNumItems * sizeof( LIST_NODE ));
assert( pstNodes != NULL );

// Create an array that gives the access order for the nodes
plDataValues = (long*)malloc( lNumItems * sizeof( long ));
assert( plDataValues != NULL );

// Access the data in order
for ( i = 0; i < lNumItems; i++ )
plDataValues[i] = i;

StartTimer( &t1 );

// Loop through and access the memory a bunch of times
for ( j = 0; j < lNumItems; j++ )
{
for ( i = 0; i < lNumItems; i++ )
{
pstNodes[plDataValues[i]].lData = i * j;
}
}

StopTimer( t1, &dms );
printf( "Total Ordered Time: %f\n", dms );

// now access the array positions in a "random" order
ShuffleArray( plDataValues, lNumItems );

StartTimer( &t1 );

for ( j = 0; j < lNumItems; j++ )
{
for ( i = 0; i < lNumItems; i++ )
{
pstNodes[plDataValues[i]].lData = i * j;
}
}

StopTimer( t1, &dms );
printf( "Total Random Time: %f\n", dms );

}

最佳答案

这里试图通过类比烘烤巧克力曲奇来深入了解缓存未命中的相对成本......

你的手就是你的寄存器。将巧克力片放入面团中需要 1 秒钟。

厨房柜台是您的 L1 缓存,比收银机慢 12 倍。走到柜台、拿起一袋核桃并将一些核桃倒进您的厨房需要 12 x 1 = 12 秒手。

冰箱是你的 L2 缓存,比 L1 慢四倍。走到冰箱需要 4 x 12 = 48 秒,打开它,把昨晚的剩菜移开,拿走拿出一盒鸡蛋,打开纸盒,将 3 个鸡蛋放在柜台上,然后将纸盒放回冰箱。

柜子是你的L3缓存,比L2慢三倍。走三步到柜子,弯腰,开门,root需要3 x 48 = 2分24秒四处寻找烘焙供应 jar ,从橱柜中取出它,打开它,挖出发酵粉,把它放在柜台上,扫掉你洒在地板上的烂摊子。

主存呢?那是街角的商店,比 L3 慢 5 倍。 需要 5 x 2:24 = 12 分钟才能找到你的钱包,穿上你的鞋子和夹克,冲到街上,拿一升牛奶,冲回家,脱下鞋子和夹克,然后回到厨房。

请注意,所有这些访问都是恒定的复杂性 -- O(1) -- 但它们之间的差异会对性能产生巨大影响。纯粹针对大 O 复杂性进行优化就像决定是一次向面糊中添加 1 个巧克力片还是一次 10 个巧克力片,但忘记将它们放在您的购物 list 上。

故事的寓意:组织您的内存访问,使 CPU 必须尽可能少地去买东西。

数字取自 CPU Cache Flushing Fallacy博文,表明对于特定的 2012 年英特尔处理器,以下内容为真:

  • 寄存器访问 = 每个周期 4 条指令
  • L1 延迟 = 3 个周期(12 个寄存器)
  • L2 延迟 = 12 个周期(4 x L1、48 x 寄存器)
  • L3 延迟 = 38 个周期(3 x L2、12 x L1、144 x 寄存器)
  • DRAM 延迟 = 65 ns = 3 GHz CPU 上的 195 个周期(5 x L3、15 x L2、60 x L1、720 x 寄存器)

Gallery of Processor Cache Effects也很好地阅读了这个主题。

Mmmm, cookies ...

关于c - L1 缓存未命中的成本是多少?,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/1126529/

24 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com