clflush 通过 C 函数使缓存行无效-6ren

clflush 通过 C 函数使缓存行无效

转载作者：太空狗更新时间：2023-10-29 17:21:06

我正在尝试使用 clflush 手动逐出缓存行以确定缓存和行大小。我没有找到任何关于如何使用该指令的指南。我所看到的只是一些代码为此目的使用了更高级别的函数。

有一个内核函数void clflush_cache_range(void *vaddr, unsigned int size)，但我仍然不知道要在我的代码中包含什么以及如何使用它。我不知道该函数中的 size 是多少。

不仅如此，我如何确定该行已被逐出以验证我的代码的正确性？

更新:

这是我正在尝试做的事情的初始代码。

#include <immintrin.h>
#include <stdint.h>
#include <x86intrin.h>
#include <stdio.h>
int main()
{
  int array[ 100 ];
  /* will bring array in the cache */
  for ( int i = 0; i < 100; i++ )
    array[ i ] = i;

  /* FLUSH A LINE */
  /* each element is 4 bytes */
  /* assuming that cache line size is 64 bytes */
  /* array[0] till array[15] is flushed */
  /* even if line size is less than 64 bytes */
  /* we are sure that array[0] has been flushed */
  _mm_clflush( &array[ 0 ] );



  int tm = 0;
  register uint64_t time1, time2, time3;


  time1 = __rdtscp( &tm ); /* set timer */
  time2 = __rdtscp( &array[ 0 ] ) - time1; /* array[0] is a cache miss */
  printf( "miss latency = %lu \n", time2 );

  time3 = __rdtscp( &array[ 0 ] ) - time2; /* array[0] is a cache hit */
  printf( "hit latency = %lu \n", time3 );
  return 0;
}

在运行代码之前，我想手动验证它是一个正确的代码。我在正确的道路上吗？我是否正确使用了 _mm_clflush？

更新:

感谢 Peter 的评论，我按如下方式修复了代码

  time1 = __rdtscp( &tm ); /* set timer */
  time2 = __rdtscp( &array[ 0 ] ) - time1; /* array[0] is a cache miss */
  printf( "miss latency = %lu \n", time2 );
  time1 = __rdtscp( &tm ); /* set timer */
  time2 = __rdtscp( &array[ 0 ] ) - time1; /* array[0] is a cache hit */
  printf( "hit latency = %lu \n", time1 );

通过多次运行代码，我得到以下输出

$ ./flush
miss latency = 238
hit latency = 168
$ ./flush
miss latency = 154
hit latency = 140
$ ./flush
miss latency = 252
hit latency = 140
$ ./flush
miss latency = 266
hit latency = 252

第一次运行似乎是合理的。但是第二轮看起来很奇怪。通过从命令行运行代码，每次使用值初始化数组，然后我明确地逐出第一行。

更新4:

我试过 Hadi-Brais 代码，这里是输出

naderan@webshub:~$ ./flush3
address = 0x7ffec7a92220
array[ 0 ] = 0
miss section latency = 378
array[ 0 ] = 0
hit section latency = 175
overhead latency = 161
Measured L1 hit latency = 14 TSC cycles
Measured main memory latency = 217 TSC cycles
naderan@webshub:~$ ./flush3
address = 0x7ffedbe0af40
array[ 0 ] = 0
miss section latency = 392
array[ 0 ] = 0
hit section latency = 231
overhead latency = 168
Measured L1 hit latency = 63 TSC cycles
Measured main memory latency = 224 TSC cycles
naderan@webshub:~$ ./flush3
address = 0x7ffead7fdc90
array[ 0 ] = 0
miss section latency = 399
array[ 0 ] = 0
hit section latency = 161
overhead latency = 147
Measured L1 hit latency = 14 TSC cycles
Measured main memory latency = 252 TSC cycles
naderan@webshub:~$ ./flush3
address = 0x7ffe51a77310
array[ 0 ] = 0
miss section latency = 364
array[ 0 ] = 0
hit section latency = 182
overhead latency = 161
Measured L1 hit latency = 21 TSC cycles
Measured main memory latency = 203 TSC cycles

稍微不同的延迟是可以接受的。然而，与 21 和 14 相比，命中延迟为 63 也是可以观察到的。

更新5:

当我检查 Ubuntu 时，没有启用省电功能。可能 bios 中禁用了频率更改，或者配置错误

$ cat /proc/cpuinfo  | grep -E "(model|MHz)"
model           : 79
model name      : Intel(R) Xeon(R) CPU E5-2620 v4 @ 2.10GHz
cpu MHz         : 2097.571
model           : 79
model name      : Intel(R) Xeon(R) CPU E5-2620 v4 @ 2.10GHz  
cpu MHz         : 2097.571
$ lscpu | grep MHz
CPU MHz:             2097.571

无论如何，这意味着频率设置为其最大值，这是我必须关心的。通过多次运行，我看到了一些不同的值。这些正常吗？

$ taskset -c 0 ./flush3
address = 0x7ffe30c57dd0
array[ 0 ] = 0
miss section latency = 602
array[ 0 ] = 0
hit section latency = 161
overhead latency = 147
Measured L1 hit latency = 14 TSC cycles
Measured main memory latency = 455 TSC cycles
$ taskset -c 0 ./flush3
address = 0x7ffd16932fd0
array[ 0 ] = 0
miss section latency = 399
array[ 0 ] = 0
hit section latency = 168
overhead latency = 147
Measured L1 hit latency = 21 TSC cycles
Measured main memory latency = 252 TSC cycles
$ taskset -c 0 ./flush3
address = 0x7ffeafb96580
array[ 0 ] = 0
miss section latency = 364
array[ 0 ] = 0
hit section latency = 161
overhead latency = 140
Measured L1 hit latency = 21 TSC cycles
Measured main memory latency = 224 TSC cycles
$ taskset -c 0 ./flush3
address = 0x7ffe58291de0
array[ 0 ] = 0
miss section latency = 357
array[ 0 ] = 0
hit section latency = 168
overhead latency = 140
Measured L1 hit latency = 28 TSC cycles
Measured main memory latency = 217 TSC cycles
$ taskset -c 0 ./flush3
address = 0x7fffa76d20b0
array[ 0 ] = 0
miss section latency = 371
array[ 0 ] = 0
hit section latency = 161
overhead latency = 147
Measured L1 hit latency = 14 TSC cycles
Measured main memory latency = 224 TSC cycles
$ taskset -c 0 ./flush3
address = 0x7ffdec791580
array[ 0 ] = 0
miss section latency = 357
array[ 0 ] = 0
hit section latency = 189
overhead latency = 147
Measured L1 hit latency = 42 TSC cycles
Measured main memory latency = 210 TSC cycles

最佳答案

您的代码中存在多个错误，这些错误可能会导致出现您所看到的无意义的测量值。我已经修复了错误，您可以在下面的评论中找到解释。

/* compile with gcc at optimization level -O3 */
/* set the minimum and maximum CPU frequency for all cores using cpupower to get meaningful results */ 
/* run using "sudo nice -n -20 ./a.out" to minimize possible context switches, or at least use "taskset -c 0 ./a.out" */
/* you can optionally use a p-state scaling driver other than intel_pstate to get more reproducable results */
/* This code still needs improvement to obtain more accurate measurements,
   and a lot of effort is required to do that—argh! */
/* Specifically, there is no single constant latency for the L1 because of
   the way it's designed, and more so for main memory. */
/* Things such as virtual addresses, physical addresses, TLB contents,
   code addresses, and interrupts may have an impact that needs to be
   investigated */
/* The instructions that GCC puts unnecessarily in the timed section are annoying AF */
/* This code is written to run on Intel processors! */

#include <stdint.h>
#include <x86intrin.h>
#include <stdio.h>
int main()
{
  int array[ 100 ];

  /* this is optional */
  /* will bring array in the cache */
  for ( int i = 0; i < 100; i++ )
    array[ i ] = i;

  printf( "address = %p \n", &array[ 0 ] ); /* guaranteed to be aligned within a single cache line */

  _mm_mfence();                      /* prevent clflush from being reordered by the CPU or the compiler in this direction */

  /* flush the line containing the element */
  _mm_clflush( &array[ 0 ] );

  //unsigned int aux;
  uint64_t time1, time2, msl, hsl, osl; /* initial values don't matter */

  /* You can generally use rdtsc or rdtscp.
     See: https://stackoverflow.com/questions/59759596/is-there-any-difference-in-between-rdtsc-lfence-rdtsc-and-rdtsc-rdtscp
     I AM NOT SURE THOUGH THAT THE SERIALIZATION PROERTIES OF
     RDTSCP ARE APPLICABLE AT THE COMPILER LEVEL WHEN USING THE
     __RDTSCP INTRINSIC. THIS IS TRUE FOR PURE FENCES SUCH AS LFENCE. */

  _mm_mfence();                      /* this properly orders both clflush and rdtsc*/
  _mm_lfence();                      /* mfence and lfence must be in this order + compiler barrier for rdtsc */
  time1 = __rdtsc();                 /* set timer */
  _mm_lfence();                      /* serialize __rdtsc with respect to trailing instructions + compiler barrier for rdtsc and the load */
  int temp = array[ 0 ];             /* array[0] is a cache miss */
  /* measring the write miss latency to array is not meaningful because it's an implementation detail and the next write may also miss */
  /* no need for mfence because there are no stores in between */
  _mm_lfence();                      /* mfence and lfence must be in this order + compiler barrier for rdtsc and the load*/
  time2 = __rdtsc();
  _mm_lfence();                      /* serialize __rdtsc with respect to trailing instructions */
  msl = time2 - time1;

  printf( "array[ 0 ] = %i \n", temp );             /* prevent the compiler from optimizing the load */
  printf( "miss section latency = %lu \n", msl );   /* the latency of everything in between the two rdtsc */

  _mm_mfence();                      /* this properly orders both clflush and rdtsc*/
  _mm_lfence();                      /* mfence and lfence must be in this order + compiler barrier for rdtsc */
  time1 = __rdtsc();                 /* set timer */
  _mm_lfence();                      /* serialize __rdtsc with respect to trailing instructions + compiler barrier for rdtsc and the load */
  temp = array[ 0 ];                 /* array[0] is a cache hit as long as the OS, a hardware prefetcher, or a speculative accesses to the L1D or lower level inclusive caches don't evict it */
  /* measring the write miss latency to array is not meaningful because it's an implementation detail and the next write may also miss */
  /* no need for mfence because there are no stores in between */
  _mm_lfence();                      /* mfence and lfence must be in this order + compiler barrier for rdtsc and the load */
  time2 = __rdtsc();
  _mm_lfence();                      /* serialize __rdtsc with respect to trailing instructions */
  hsl = time2 - time1;

  printf( "array[ 0 ] = %i \n", temp );            /* prevent the compiler from optimizing the load */
  printf( "hit section latency = %lu \n", hsl );   /* the latency of everything in between the two rdtsc */


  _mm_mfence();                      /* this properly orders both clflush and rdtsc */
  _mm_lfence();                      /* mfence and lfence must be in this order + compiler barrier for rdtsc */
  time1 = __rdtsc();                 /* set timer */
  _mm_lfence();                      /* serialize __rdtsc with respect to trailing instructions + compiler barrier for rdtsc */
  /* no need for mfence because there are no stores in between */
  _mm_lfence();                      /* mfence and lfence must be in this order + compiler barrier for rdtsc */
  time2 = __rdtsc();
  _mm_lfence();                      /* serialize __rdtsc with respect to trailing instructions */
  osl = time2 - time1;

  printf( "overhead latency = %lu \n", osl ); /* the latency of everything in between the two rdtsc */


  printf( "Measured L1 hit latency = %lu TSC cycles\n", hsl - osl ); /* hsl is always larger than osl */
  printf( "Measured main memory latency = %lu TSC cycles\n", msl - osl ); /* msl is always larger than osl and hsl */

  return 0;
}

关于clflush 通过 C 函数使缓存行无效，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/51818655/

文章推荐： c - 出于演示目的，如何打印悬挂指针？

文章推荐： Python 3.0 和语言演变

文章推荐： python - 有没有办法从 python 中的文件中读取 10000行？

文章推荐： c - 基于两种可能结构的未知空指针的访问类型标志？

c++ - 编译错误。定义不匹配。无效(*)(无效*)
我有一个接受以下参数的函数: int setvalue(void (*)(void *)); 为了满足参数:void (*)(void *)，我创建了这样一个函数: static void *
c++ - 无效、无效、C 和 C++
我有以下代码: typedef void VOID; int f(void); int g(VOID); 在 C 中编译得很好(在 Fedora 10 上使用 gcc 4.3.2)。与 C++ 编译的
c - 无效(*foo)(无效): meaning of latest (void)
这个问题已经有答案了: Is f(void) deprecated in modern C and C++? [duplicate] (6 个回答) 已关闭 7 年前。 B.A.T.M.A.N./A.
asp.net-core - 无效 token - 观众 'empty' 无效
我在 ASP.NET Core 3.1 项目上有以下 Identity Server 4 配置: services .AddIdentityServer(y => { y.Events.R
azure - 委托(delegate) token 无效。指定的国家云 ID (1) 无效
我们有一个 O365 租户，一切都是开箱即用的。租户放置在德国云中，而不是全局 (office.de) 中。我们还开发了一个 Office 插件，使用 OAuth 2.0 授权访问共享点。首先，我们向
c# - 错误请求 - 无效 URL - HTTP 错误 400。请求 URL 无效
我有一个如下所示的路由 routes.MapRoute( name: "Default", url: "{controller}/{action}/{i
java - token 无效 - token 无效 : Invalid user for the two legged OAuth
我正在尝试使用 OAuth2.0 访问 google 文档。我已经从 Google API 控制台获取了客户端 ID 和 key 。但是当我运行这段代码时，我收到了异常。如果我遗漏了什么，有人可以建议
rust - 为什么创建const指针的集合对 `for val in a.iter()`无效，而对 `a.iter().map(|val| val)`无效？
此代码有效: let mut b: Vec = Vec::with_capacity(a.len()); for val in a.iter() { b.push(val); } 此代码不起作
azure - 输入参数 'scope' 无效。范围 https ://outlook. office365.com/EWS.AccessAsUser.All 无效
使用 client_credintials 授权类型请求 EWS oauth2 v2.0 的访问 token 时出现错误。 https://login.microsoftonline.com/tena
java - token 无效 - 无效 token : Cannot parse referred token string: Invalid gaia_data. Base64 token 上的 AuthSubToken 原型(prototype)
我通过 Java 应用程序使用 Google 电子表格时遇到了问题。我创建了应用程序，该应用程序运行了 1 年多，没有任何问题，我什至在 Create Spreadsheet using Google
无效 Base64 字符的正则表达式
如何创建匹配所有无效 Base64 字符的正则表达式？我在堆栈上找到了 [^a-zA-Z0-9+/=\n\r].*$ 但是当我尝试时我得到了带有 - 符号的结果字符串.我根本不知道正则表达式，任何人
YAML 无效 - 可能是引号问题
我从 Gitlab CI/CD Pipelines 获得错误信息:yaml invalid。问题是由 .gitlab-ci.yml 脚本的第五行引起的: - 'ssh deployer@gita
spring - @Qualifier 无效
我有 3 个数据源，设置如下: @Configuration @Component public class DataSourceConfig { @Bean("foo") @Conf
mysql - updateOnDuplicate 无效
你好，我想用bulkCreate ex 插入数据: [ { "typeId": 5, "devEui": "0094E796CBFCFEF9", "application_name": "Pressu
iPhone UIApplicationExitsOnSuspend 无效
UIApplicationExitsOnSuspend 不会强制我的应用程序退出。我已经清理过目标、删除了应用程序、重建并重新安装了很多次。我确实需要退出我的应用程序。最佳答案您是否链接了 SD
iPhone 团队配置文件 - 无效
在 iPhone 配置门户上，显示我的 iPhone 团队配置配置文件无效。有一个“由 Xcode 管理”文本。 “续订”按钮被禁用。我该如何解决这个问题？谢谢最佳答案使用 Xcode 3.2.
symfony2 CSRF 无效
好的，所以今天我用我们的“实时”数据库中的新信息更新了我的数据库……从那时起，我的一个表格就出现了问题。如果您需要任何代码，请告诉我，我将对其进行编辑并发布所需的代码... 我有一个报告表格，其中有一
有人可以解释这是什么意思吗？无效(*func)()；
我有一个结构体，其中有一个元素表示为 void (*func)(); 我知道 void 指针通常用于函数指针，但我似乎无法定义该函数。我不断收到取消引用指向不完整类型的指针。我用谷歌搜索了一下但没有结
Coldfusion，oauth_signature 无效
我正在尝试使用 Coldfusion 9 从 ning 网络获取凭证，所以首先这是测试 api 的 curl 语法: curl -k https://external.ningapis.com/xn/
c - 为什么此引用不起作用/无效？
这个问题已经有答案了: Does C have references? (2 个回答) 已关闭 4 年前。我正在学习 C 语言引用，这是我的代码: #include int main(void)

太空狗

个人简介

我是一名优秀的程序员,十分优秀！

作者热门文章

滴滴打车优惠券免费领取

全站热门文章

首页

博学

6Ren·AI

商城

clflush 通过 C 函数使缓存行无效