c++ - 解释为什么第二次分配会改变性能-6ren

c++ - 解释为什么第二次分配会改变性能

转载作者：塔克拉玛干更新时间：2023-11-03 00:37:36

我正在测试一些关于密集矩阵乘法的微基准(出于好奇)，我注意到一些非常奇怪的性能结果。

这是一个最小的工作示例:

#include <benchmark/benchmark.h>

#include <random>

constexpr long long n = 128;

struct mat_bench_fixture : public benchmark::Fixture
{
  double *matA, *matB, *matC;

  mat_bench_fixture()
  {
    matA = new double[n * n];
    matB = new double[n * n];
    matC = new double[n * n];
    benchmark::DoNotOptimize(matA);
    benchmark::DoNotOptimize(matB);
    benchmark::DoNotOptimize(matC);
#if 0
    delete[] matA;
    delete[] matB;
    delete[] matC;
    benchmark::DoNotOptimize(matA);
    benchmark::DoNotOptimize(matB);
    benchmark::DoNotOptimize(matC);
    matA = new double[n * n];
    matB = new double[n * n];
    matC = new double[n * n];
    benchmark::DoNotOptimize(matA);
    benchmark::DoNotOptimize(matB);
    benchmark::DoNotOptimize(matC);
#endif
  }

  ~mat_bench_fixture()
  {
    delete[] matA;
    delete[] matB;
    delete[] matC;
  }

  void SetUp(const benchmark::State& s) override
  {
    // generate random data
    std::mt19937 gen;
    std::uniform_real_distribution<double> dis(0, 1);
    for (double* i = matA; i != matA + n * n; ++i)
    {
      *i = dis(gen);
    }
    for (double* i = matB; i != matB + n * n; ++i)
    {
      *i = dis(gen);
    }
  }
};

BENCHMARK_DEFINE_F(mat_bench_fixture, impl1)(benchmark::State& st)
{
  for (auto _ : st)
  {
    for (long long row = 0; row < n; ++row)
    {
      for (long long col = 0; col < n; ++col)
      {
        matC[row * n + col] = 0;
        for (long long k = 0; k < n; ++k)
        {
          matC[row * n + col] += matA[row * n + k] * matB[k * n + col];
        }
      }
    }
    benchmark::DoNotOptimize(matA);
    benchmark::DoNotOptimize(matB);
    benchmark::DoNotOptimize(matC);
    benchmark::ClobberMemory();
  }
}

BENCHMARK_REGISTER_F(mat_bench_fixture, impl1);

BENCHMARK_MAIN();

夹具的构造函数中有一个#if 0 block ，对于我正在测试的两个不同场景，它可以切换为#if 1。我注意到的是，出于某种原因，当我强制重新分配所有缓冲区时，出于某种原因，在我的系统上，基准测试运行所需的时间神奇地提高了大约 15%，我没有解释为什么这正在发生。我希望有人能启发我。我还想知道是否有任何额外的微基准测试“最佳实践”建议来避免将来出现此类奇怪的性能异常。

我是如何编译的(假设 Google Benchmark 已经安装在可以找到的地方):

$CC -o mult_test mult_test.cpp -std=c++14 -pthread -O3 -fno-omit-frame-pointer -lbenchmark

我一直在用:

./mult_test --benchmark_repetitions=5

我正在 Ubuntu 18.04 x64(内核版本 4.15.0-30-generic)中进行所有测试

我已经尝试了这段代码的几种不同变体，它们在多次运行中都给出了相同的基本结果(结果对我来说如此一致令人惊讶):

将分配/初始化移到基准“设置”阶段(非定时部分)内，以便分配/解除分配发生在每个新样本点
在 GCC 7.3.0 和 Clang 6.0.0 之间切换编译器
尝试了配备不同 CPU 的不同计算机(Intel i5-6600K，以及一台配备双插槽 Xeon E5-2630 v2 的计算机)
尝试了不同的方法来实现基准框架(即根本不使用 Google Benchmark 并通过 std::chrono 手动实现计时)
强制所有缓冲区对齐到几个不同的边界(64 字节、128 字节、256 字节)
在每个采样时间周期内强制进行固定次数的迭代
尝试以更高的重复次数运行(20 次以上)
使用性能调节器强制 CPU 时钟频率恒定
为优化选项尝试了不同的编译器标志(删除了 no-omit-frame-pointer，尝试了 -march=native)
我试过使用 std::vector 来管理存储，使用 new[]/delete[] 对和 malloc/free。它们都给出了相似的结果。

我已经比较了代码的热点部分的汇编，并且它在两个测试用例之间是相同的(其中一个案例来自 perf 的屏幕截图):

40:
  mov 0xc0(%r15),$rcx
  mov 0xd0(%r15),%rdx
  add $0x8,$rcx
  move 0xc8(%r15),%r9
  add %r8,%r9
  xor %r10d,%r10d
  nop
60: 
  mov %r10,%r11
  shl $0x7,$r11
  mov %r9,%r13
  xor %esi,%esi
  nop
70:
  lea (%rsi,%r11,1),%rax
  movq %0x0,(%rdx,%rax,8)
  xordp %xmm0,%xmm0
  mov $0xffffffffffffff80,%rdi
  mov %r13,%rbx
  nop
90:
  movsd 0x3f8(%rcx,%rdi,8),%xmm1
  mulsd -0x400(%rbx),%xmm1
  addsd %xmm0,%xmm1
  movsd %xmm1,(%rdx,%rax,8)
  movsd 0x400(%rcs,%rdi,8),%xmm0
  mulsd (%rbx),%xmm0
  addsd %xmm1,%xmm0
  movsd %xmm0,(%rdx,%rax,8)
  add $0x800,%rbx
  add $0x2,%rdi
  jne 90
  add $0x1,%rsi
  add $0x8,%r13
  cmp $0x80,%rsi
  jne 70
  add $0x1,%r10
  add $0x400,%rcx
  cmp $0x80,%r10
  jne 60
  add $0xffffffffffffffff,%r12
  jne 40

这里是 perf stat 未执行重新分配的代表性屏幕截图:

Running ./mult_test
Run on (4 X 4200 MHz CPU s)
CPU Caches:
  L1 Data 32K (x4)
  L1 Instruction 32K (x4)
  L2 Unified 256K (x4)
  L3 Unified 6144K (x1)
----------------------------------------------------------------------
Benchmark                               Time           CPU Iterations
----------------------------------------------------------------------
mat_bench_fixture/impl1           2181531 ns    2180896 ns        322
mat_bench_fixture/impl1           2188280 ns    2186860 ns        322
mat_bench_fixture/impl1           2182988 ns    2182150 ns        322
mat_bench_fixture/impl1           2182715 ns    2182025 ns        322
mat_bench_fixture/impl1           2175719 ns    2175653 ns        322
mat_bench_fixture/impl1_mean      2182246 ns    2181517 ns        322
mat_bench_fixture/impl1_median    2182715 ns    2182025 ns        322
mat_bench_fixture/impl1_stddev       4480 ns       4000 ns        322

 Performance counter stats for './mult_test --benchmark_repetitions=5':

       3771.370173      task-clock (msec)         #    0.994 CPUs utilized          
               223      context-switches          #    0.059 K/sec                  
                 0      cpu-migrations            #    0.000 K/sec                  
               242      page-faults               #    0.064 K/sec                  
    15,808,590,474      cycles                    #    4.192 GHz                      (61.31%)
    20,201,201,797      instructions              #    1.28  insn per cycle           (69.04%)
     1,844,097,332      branches                  #  488.973 M/sec                    (69.04%)
           358,319      branch-misses             #    0.02% of all branches          (69.14%)
     7,232,957,363      L1-dcache-loads           # 1917.859 M/sec                    (69.24%)
     3,774,591,187      L1-dcache-load-misses     #   52.19% of all L1-dcache hits    (69.35%)
       558,507,528      LLC-loads                 #  148.091 M/sec                    (69.46%)
            93,136      LLC-load-misses           #    0.02% of all LL-cache hits     (69.47%)
   <not supported>      L1-icache-loads                                             
           736,008      L1-icache-load-misses                                         (69.47%)
     7,242,324,412      dTLB-loads                # 1920.343 M/sec                    (69.34%)
               581      dTLB-load-misses          #    0.00% of all dTLB cache hits   (61.50%)
             1,582      iTLB-loads                #    0.419 K/sec                    (61.39%)
               307      iTLB-load-misses          #   19.41% of all iTLB cache hits   (61.29%)
   <not supported>      L1-dcache-prefetches                                        
   <not supported>      L1-dcache-prefetch-misses                                   

       3.795924436 seconds time elapsed

这是用于强制重新分配的 perf stat 的代表性屏幕截图:

Running ./mult_test
Run on (4 X 4200 MHz CPU s)
CPU Caches:
  L1 Data 32K (x4)
  L1 Instruction 32K (x4)
  L2 Unified 256K (x4)
  L3 Unified 6144K (x1)
----------------------------------------------------------------------
Benchmark                               Time           CPU Iterations
----------------------------------------------------------------------
mat_bench_fixture/impl1           1862961 ns    1862919 ns        376
mat_bench_fixture/impl1           1861986 ns    1861947 ns        376
mat_bench_fixture/impl1           1860330 ns    1860305 ns        376
mat_bench_fixture/impl1           1859711 ns    1859652 ns        376
mat_bench_fixture/impl1           1863299 ns    1863273 ns        376
mat_bench_fixture/impl1_mean      1861658 ns    1861619 ns        376
mat_bench_fixture/impl1_median    1861986 ns    1861947 ns        376
mat_bench_fixture/impl1_stddev       1585 ns       1591 ns        376

 Performance counter stats for './mult_test --benchmark_repetitions=5':

       3724.287293      task-clock (msec)         #    0.995 CPUs utilized          
                11      context-switches          #    0.003 K/sec                  
                 0      cpu-migrations            #    0.000 K/sec                  
               246      page-faults               #    0.066 K/sec                  
    15,612,924,579      cycles                    #    4.192 GHz                      (61.34%)
    23,344,859,019      instructions              #    1.50  insn per cycle           (69.07%)
     2,130,528,330      branches                  #  572.063 M/sec                    (69.07%)
           331,651      branch-misses             #    0.02% of all branches          (69.08%)
     8,369,233,786      L1-dcache-loads           # 2247.204 M/sec                    (69.18%)
     4,206,241,296      L1-dcache-load-misses     #   50.26% of all L1-dcache hits    (69.29%)
       308,687,646      LLC-loads                 #   82.885 M/sec                    (69.40%)
            94,288      LLC-load-misses           #    0.03% of all LL-cache hits     (69.50%)
   <not supported>      L1-icache-loads                                             
           475,066      L1-icache-load-misses                                         (69.50%)
     8,360,570,315      dTLB-loads                # 2244.878 M/sec                    (69.37%)
               364      dTLB-load-misses          #    0.00% of all dTLB cache hits   (61.53%)
               213      iTLB-loads                #    0.057 K/sec                    (61.42%)
               144      iTLB-load-misses          #   67.61% of all iTLB cache hits   (61.32%)
   <not supported>      L1-dcache-prefetches                                        
   <not supported>      L1-dcache-prefetch-misses                                   

       3.743017809 seconds time elapsed

这是一个最小的工作示例，它没有任何外部依赖项，并允许测试内存对齐问题:

#include <random>
#include <chrono>
#include <iostream>
#include <cstdlib>

constexpr long long n = 128;
constexpr size_t alignment = 64;

inline void escape(void* p)
{
  asm volatile("" : : "g"(p) : "memory");
}
inline void clobber()
{
  asm volatile("" : : : "memory");
}

struct mat_bench_fixture
{
  double *matA, *matB, *matC;

  mat_bench_fixture()
  {
    matA = (double*) aligned_alloc(alignment, sizeof(double) * n * n);
    matB = (double*) aligned_alloc(alignment, sizeof(double) * n * n);
    matC = (double*) aligned_alloc(alignment, sizeof(double) * n * n);
    escape(matA);
    escape(matB);
    escape(matC);
#if 0
    free(matA);
    free(matB);
    free(matC);
    escape(matA);
    escape(matB);
    escape(matC);
    matA = (double*) aligned_alloc(alignment, sizeof(double) *n * n);
    matB = (double*) aligned_alloc(alignment, sizeof(double) *n * n);
    matC = (double*) aligned_alloc(alignment, sizeof(double) *n * n);
    escape(matA);
    escape(matB);
    escape(matC);
#endif
  }

  ~mat_bench_fixture()
  {
    free(matA);
    free(matB);
    free(matC);
  }

  void SetUp()
  {
    // generate random data
    std::mt19937 gen;
    std::uniform_real_distribution<double> dis(0, 1);
    for (double* i = matA; i != matA + n * n; ++i)
    {
      *i = dis(gen);
    }
    for (double* i = matB; i != matB + n * n; ++i)
    {
      *i = dis(gen);
    }
  }
  void run()
  {
    constexpr int iters = 400;
    std::chrono::high_resolution_clock timer;
    auto start = timer.now();
    for (int i = 0; i < iters; ++i)
    {
      for (long long row = 0; row < n; ++row)
      {
        for (long long col = 0; col < n; ++col)
        {
          matC[row * n + col] = 0;
          for (long long k = 0; k < n; ++k)
          {
            matC[row * n + col] += matA[row * n + k] * matB[k * n + col];
          }
        }
      }
      escape(matA);
      escape(matB);
      escape(matC);
      clobber();
    }
    auto stop = timer.now();
    std::cout << std::chrono::duration_cast<std::chrono::nanoseconds>(
                   stop - start)
                     .count() /
                   iters
              << std::endl;
  }
};

int main()
{
  mat_bench_fixture bench;
  for (int i = 0; i < 5; ++i)
  {
    bench.SetUp();
    bench.run();
  }
}

编译:

g++ -o mult_test mult_test.cpp -std=c++14 -O3

最佳答案

在我的机器上，我可以通过对指针使用不同的对齐方式来重现您的情况。试试这个代码:

mat_bench_fixture() {
    matA = new double[n * n + 256];
    matB = new double[n * n + 256];
    matC = new double[n * n + 256];

    // align pointers to 1024
    matA = reinterpret_cast<double*>((reinterpret_cast<unsigned long long>(matA) + 1023)&~1023);
    matB = reinterpret_cast<double*>((reinterpret_cast<unsigned long long>(matB) + 1023)&~1023);
    matC = reinterpret_cast<double*>((reinterpret_cast<unsigned long long>(matC) + 1023)&~1023);

    // toggle this to toggle alignment offset of matB
    // matB += 2;
}

如果我切换这段代码中的注释行，我的机器上有 34% 的差异。

不同的对齐偏移会导致不同的时序。您也可以尝试抵消其他 2 个指针。有时差异较小，有时较大，有时没有变化。

这一定是由缓存问题引起的:由于指针的最后一位不同，缓存中会出现不同的冲突模式。由于您的例程是内存密集型的(所有数据都不适合 L1)，缓存性能非常重要。

关于c++ - 解释为什么第二次分配会改变性能，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/51829128/

文章推荐： linux - Shell 脚本段错误 - AWS

文章推荐：用于 Linux 正常运行时间命令/响应的 PHP preg_match

文章推荐： android - android 中的布局 - 如何制作 3 列？

iphone - GKSession 分配/释放/分配 = 泄漏和崩溃
我有一个应用程序，它会抛出 GKSession 并在各种条件下(连接超时、 session 失败等)创建一个新的 GKSession。不过，我遇到了内存泄漏问题，并且有时会在重新连接几次循环后崩溃。
c - 是否可以说哪个指针由 cudaMalloc 分配，哪个由 malloc 分配？
比如我在宿主代码中有一个浮点指针 float *p 是否可以确定他指向的内存类型(设备/主机)？最佳答案在 UVA system 中, 运行时 API 函数 cudaPointerGetAttri
.net - 运行时类型句柄.分配
我已将项目转换为 .Net 4.0 并且以下代码不起作用: typeof(RuntimeTypeHandle).GetMethod("Allocate", BindingFlags.Instance
分配 `ab` 时包含单个字符的字符
当我声明 char ch = 'ab' 时，ch 只包含 'b'，为什么它不存储 'a'？ #include int main() { char ch = 'ab'; printf("%c"
文件的磁盘扇区和 block 分配
我对 Disk Sector 和 Block 有疑问。扇区是一个单位，通常为 512 字节或 1k、2k、4k 等取决于硬件。文件系统 block 大小是一组扇区大小。假设我正在存储一个 5KB 的
javascript - 分配/分发随机数量
假设我有 8 个人和5000 个苹果。我想将所有苹果分发给所有 8 个人，这样我就没有苹果了。但每个人都应该得到不同数量将它们全部分发出去的最佳方式是什么？我是这样开始的: let peopl
javascript - 分配 "/"热键以在搜索框上创建焦点用户
我正在构建的网站顶部有一个搜索栏。与 Trello 或 Gmail 类似，我希望当用户按下“/”键时，他们的焦点就会转到该搜索框。我的 JavaScript 看起来像这样: document.onk
javascript - 事件处理程序之间的一个 $this 分配
我有一小段代码: if (PZ_APP.dom.isAnyDomElement($textInputs)){ $textInputs.on("focus", function(){
iphone - iOS保留，分配
我观察到以下行为。接受了两个属性变量。 @property (nonatomic, retain) NSString *stringOne; @property (nonatomic, assign
java - BODMAS 分配
我正在解决这样的问题 - 实现一个计算由以下内容组成的表达式的函数以下操作数:“(”、“)”、“+”、“-”、“*”、“/”。中的每个数字表达式可能很大(与由字符串表示的一样大)1000 位)。 “/
python - 主机中任务的指派/分配
我有一组主机和一组任务。每个主机都有 cpu、mem 和任务容量，每个任务都有 cpu、mem 要求。每个主机都属于一个延迟类别，并且可以与具有特定延迟值的其他主机通信。每个任务可能需要以等于或
c - c中内存的重新分配/分配
该程序的作用:从文件中读取一个包含 nrRows 行和 nrColomns 列的矩阵(二维数组)。矩阵的所有元素都是 [0,100) 之间的整数。程序必须重新排列矩阵内的所有元素，使每个元素等于其所在
c++ - 长号。分配
世界!我有个问题。今天我尝试创建一个代码，它可以找到加泰罗尼亚语号码。但是在我的程序中可以是长数字。我找到了分子和分母。但我不能分割长数字!此外，只有标准库必须在此程序中使用。请帮帮我。这是我的代码
ios - 分配 NSInteger*
我确定我遗漏了一些明显的东西，但我想在 Objective C 中创建一个 NSInteger 指针的实例。 -(NSInteger*) getIntegerPointer{ NSInteger
ios - 分配/初始化只读属性
这个问题在这里已经有了答案: Difference between self.ivar and ivar? (4 个答案) 关闭 9 年前。
c++ - 分配 vector
我如何将 v[i] 分配给一系列整数(v 的类型是 vector )而无需最初填充最佳答案你的意思是将 std::vector 初始化为一系列整数？ int i[] = {1, 2, 3, 4,
c - 分配 - 指针到指针
我想寻求分配方面的帮助....我把这个作业带到了学校......我必须编写程序来加载一个 G 矩阵和第二个 G 矩阵，并搜索第二个 G 矩阵以获取存在数第一个 G 矩阵的......但是，当我尝试运行
c - 分配/取消分配资源
我必须管理资源。它基本上是一个唯一的编号，用于标识交换机中的第 2 层连接。可以有 16k 个这样的连接，因此每次用户希望配置连接时，他/她都需要分配一个唯一索引。同样，当用户希望删除连接时，资源(号
c - 分配/未分配字符串的命名约定
是否有任何通用的命名约定来区分已分配和未分配的字符串？我正在寻找的是希望类似于 us/s 来自 Making Wrong Code Look Wrong ，但我宁愿使用常见的东西也不愿自己动手。最佳
c - 如何解决以下函数中的内存分配问题？ (分配)
我需要读取一个 .txt 文件并将文件中的每个单词分配到一个结构中，该结构从结构 vector 指向。我将在下面更好地解释。感谢您的帮助。我的程序只分配文件的第一个字... 我知道问题出在函数 i

塔克拉玛干

个人简介

我是一名优秀的程序员,十分优秀！

作者热门文章

滴滴打车优惠券免费领取

全站热门文章

首页

博学

6Ren·AI

商城

c++ - 解释为什么第二次分配会改变性能