java - L3 cpu 缓存 java 基准测试显示奇怪的结果-6ren

java - L3 cpu 缓存 java 基准测试显示奇怪的结果

转载作者：搜寻专家更新时间：2023-11-01 02:58:48

看完这篇article我决定在我的笔记本电脑上检查一下。这个想法是创建大小为 [1..40] Mb 的数组，然后迭代它 1024 次(例如，对于大小为 1 的数组，步长为 1024，对于大小为 2 mb 的数组，步长为 2048 等)。我的代码是:

public class L3CacheBenchmark {

    @State(Scope.Benchmark)
    public static class P {

        @Param({
                       "1", "2", "3", "4", "5", "6", "7", "8", "9", "10",
                       "11", "12", "13", "14", "15", "16", "17", "18", "19", "20",
                       "21", "22", "23", "24", "25", "26", "27", "28", "29", "30",
                       "31", "32", "33", "34", "35", "36", "37", "38", "39", "40",
               })
        public int size;
    }

    @State(Scope.Thread)
    public static class ThreadData {

        byte[] array;
        int    len;

        @Setup
        public void setup(P p) {
            array = new byte[p.size * 1024 * 1024];
            len = array.length;
        }
    }


    @Benchmark
    public byte[] testMethod(ThreadData data) {
        int step = (data.len / 1024) - 1;
        for (int k = 0; k < data.len; k += step) {
            data.array[k] = 1;
        }
        return data.array;
    }

}

和结果:

Benchmark                    (size)   Mode  Cnt       Score       Error  Units
L3CacheBenchmark.testMethod       1  thrpt  100  310521,031 ±  1124,590  ops/s
L3CacheBenchmark.testMethod       2  thrpt  100  331853,495 ±  1124,547  ops/s
L3CacheBenchmark.testMethod       3  thrpt  100  311499,659 ±   745,414  ops/s
L3CacheBenchmark.testMethod       4  thrpt  100  290270,382 ±  8501,690  ops/s
L3CacheBenchmark.testMethod       5  thrpt  100  212929,246 ± 14847,931  ops/s
L3CacheBenchmark.testMethod       6  thrpt  100  315968,138 ±  4454,210  ops/s
L3CacheBenchmark.testMethod       7  thrpt  100  209679,904 ± 26050,365  ops/s
L3CacheBenchmark.testMethod       8  thrpt  100   60409,187 ±   212,548  ops/s
L3CacheBenchmark.testMethod       9  thrpt  100  221290,756 ± 28970,586  ops/s
L3CacheBenchmark.testMethod      10  thrpt  100  322865,687 ±  1545,967  ops/s
L3CacheBenchmark.testMethod      11  thrpt  100  263153,747 ± 18497,624  ops/s
L3CacheBenchmark.testMethod      12  thrpt  100  298683,205 ±  1277,032  ops/s
L3CacheBenchmark.testMethod      13  thrpt  100  180984,220 ± 26611,649  ops/s
L3CacheBenchmark.testMethod      14  thrpt  100  324815,938 ±  1657,303  ops/s
L3CacheBenchmark.testMethod      15  thrpt  100  264965,412 ±  9335,923  ops/s
L3CacheBenchmark.testMethod      16  thrpt  100   58830,825 ±   291,412  ops/s
L3CacheBenchmark.testMethod      17  thrpt  100  255576,829 ±  7083,025  ops/s
L3CacheBenchmark.testMethod      18  thrpt  100  324174,133 ±  2247,157  ops/s
L3CacheBenchmark.testMethod      19  thrpt  100  212969,202 ± 18204,625  ops/s
L3CacheBenchmark.testMethod      20  thrpt  100  295246,470 ±  1224,817  ops/s
L3CacheBenchmark.testMethod      21  thrpt  100  251762,642 ± 23405,100  ops/s
L3CacheBenchmark.testMethod      22  thrpt  100  323196,428 ±  2245,465  ops/s
L3CacheBenchmark.testMethod      23  thrpt  100  254588,338 ± 23845,090  ops/s
L3CacheBenchmark.testMethod      24  thrpt  100   53373,580 ±   252,183  ops/s
L3CacheBenchmark.testMethod      25  thrpt  100  213220,459 ± 20440,716  ops/s
L3CacheBenchmark.testMethod      26  thrpt  100  322625,597 ±  2076,341  ops/s
L3CacheBenchmark.testMethod      27  thrpt  100  293643,720 ±  5260,010  ops/s
L3CacheBenchmark.testMethod      28  thrpt  100  297432,240 ±  1186,920  ops/s
L3CacheBenchmark.testMethod      29  thrpt  100  169277,701 ± 25040,239  ops/s
L3CacheBenchmark.testMethod      30  thrpt  100  324230,899 ±  1579,103  ops/s
L3CacheBenchmark.testMethod      31  thrpt  100  193981,979 ± 12478,424  ops/s
L3CacheBenchmark.testMethod      32  thrpt  100   53761,030 ±   259,888  ops/s
L3CacheBenchmark.testMethod      33  thrpt  100  213585,493 ± 23543,671  ops/s
L3CacheBenchmark.testMethod      34  thrpt  100  325214,062 ±  1758,479  ops/s
L3CacheBenchmark.testMethod      35  thrpt  100  306652,634 ±  2237,818  ops/s
L3CacheBenchmark.testMethod      36  thrpt  100  297992,930 ±  1019,248  ops/s
L3CacheBenchmark.testMethod      37  thrpt  100  181671,812 ± 21984,441  ops/s
L3CacheBenchmark.testMethod      38  thrpt  100  321929,616 ±  1798,747  ops/s
L3CacheBenchmark.testMethod      39  thrpt  100  251587,385 ± 12292,670  ops/s
L3CacheBenchmark.testMethod      40  thrpt  100   49777,196 ±   227,620  ops/s

如您所见，吞吐量不同，最显着的差异是对于大小为 8 的倍数的数组:速度下降几乎是 4 倍。此外，例如，大小为 37 Mb 的阵列的速度几乎是 38 Mb 的两倍。我没有找到对我的发现的任何合乎逻辑的解释。

附言CPU i7 4700mq 6 Mb 高速缓存:http://www.cpu-world.com/CPUs/Core_i7/Intel-Core%20i7-4700MQ%20Mobile%20processor.html

是什么导致了这种行为？

最佳答案

您正在观察 cache associativity 的效果.

您的 CPU 每个核心具有 256 KB 8 路组关联二级缓存。它最多可以存储 256 KB/64 缓存行，其中具有相同索引位的行不超过 8 行。

您的基准循环写入 1025 个不同的地址。但是，根据步长，这些地址可能会落入少数集合中，从而导致冲突并从缓存中逐出。当 stride (step) = 8191、16383、24575 等时，这正是您的情况。

为了验证这个理论，使用-prof perfnorm 选项重新运行 JMH 基准测试。
以下是 size = 8 和 size = 9 的统计数据:

L3CacheBenchmark.testMethod:CPI                      8  thrpt      1.173  #/op
L3CacheBenchmark.testMethod:L1-dcache-load-misses    8  thrpt   1048.088  #/op
L3CacheBenchmark.testMethod:L1-dcache-loads          8  thrpt   1073.767  #/op
L3CacheBenchmark.testMethod:L1-dcache-store-misses   8  thrpt   1049.491  #/op
L3CacheBenchmark.testMethod:L1-dcache-stores         8  thrpt   1060.069  #/op
L3CacheBenchmark.testMethod:L1-icache-load-misses    8  thrpt      1.209  #/op
L3CacheBenchmark.testMethod:LLC-load-misses          8  thrpt      0.082  #/op
L3CacheBenchmark.testMethod:LLC-loads                8  thrpt      1.399  #/op
L3CacheBenchmark.testMethod:LLC-store-misses         8  thrpt      0.077  #/op
L3CacheBenchmark.testMethod:LLC-stores               8  thrpt   1035.877  #/op
L3CacheBenchmark.testMethod:branch-misses            8  thrpt      1.234  #/op
L3CacheBenchmark.testMethod:branches                 8  thrpt   2096.674  #/op
L3CacheBenchmark.testMethod:cycles                   8  thrpt  13520.964  #/op
L3CacheBenchmark.testMethod:dTLB-load-misses         8  thrpt      0.057  #/op
L3CacheBenchmark.testMethod:dTLB-loads               8  thrpt   1086.355  #/op
L3CacheBenchmark.testMethod:dTLB-store-misses        8  thrpt      0.020  #/op
L3CacheBenchmark.testMethod:dTLB-stores              8  thrpt   1068.579  #/op
L3CacheBenchmark.testMethod:iTLB-load-misses         8  thrpt      0.044  #/op
L3CacheBenchmark.testMethod:iTLB-loads               8  thrpt      0.018  #/op
L3CacheBenchmark.testMethod:instructions             8  thrpt  11530.742  #/op
L3CacheBenchmark.testMethod:stalled-cycles-backend   8  thrpt   8315.437  #/op
L3CacheBenchmark.testMethod:stalled-cycles-frontend  8  thrpt  10359.447  #/op

L3CacheBenchmark.testMethod:CPI                      9  thrpt      0.871  #/op
L3CacheBenchmark.testMethod:L1-dcache-load-misses    9  thrpt   1055.973  #/op
L3CacheBenchmark.testMethod:L1-dcache-loads          9  thrpt   1068.958  #/op
L3CacheBenchmark.testMethod:L1-dcache-store-misses   9  thrpt   1045.480  #/op
L3CacheBenchmark.testMethod:L1-dcache-stores         9  thrpt   1057.328  #/op
L3CacheBenchmark.testMethod:L1-icache-load-misses    9  thrpt      1.108  #/op
L3CacheBenchmark.testMethod:LLC-load-misses          9  thrpt      0.174  #/op
L3CacheBenchmark.testMethod:LLC-loads                9  thrpt      0.304  #/op
L3CacheBenchmark.testMethod:LLC-store-misses         9  thrpt      0.045  #/op
L3CacheBenchmark.testMethod:LLC-stores               9  thrpt      0.350  #/op
L3CacheBenchmark.testMethod:branch-misses            9  thrpt      1.072  #/op
L3CacheBenchmark.testMethod:branches                 9  thrpt   2099.846  #/op
L3CacheBenchmark.testMethod:cycles                   9  thrpt  10041.724  #/op
L3CacheBenchmark.testMethod:dTLB-load-misses         9  thrpt      0.086  #/op
L3CacheBenchmark.testMethod:dTLB-loads               9  thrpt   1073.633  #/op
L3CacheBenchmark.testMethod:dTLB-store-misses        9  thrpt      0.045  #/op
L3CacheBenchmark.testMethod:dTLB-stores              9  thrpt   1054.587  #/op
L3CacheBenchmark.testMethod:iTLB-load-misses         9  thrpt      0.044  #/op
L3CacheBenchmark.testMethod:iTLB-loads               9  thrpt      0.037  #/op
L3CacheBenchmark.testMethod:instructions             9  thrpt  11529.996  #/op
L3CacheBenchmark.testMethod:stalled-cycles-backend   9  thrpt   3439.278  #/op
L3CacheBenchmark.testMethod:stalled-cycles-frontend  9  thrpt   6888.714  #/op

最显着的是 LLC-stores 的区别:大小为 8 时为 1035，大小为 9 时几乎没有。这意味着存储的数据不适合 L2 缓存并转到L3.

顺便说一句，您的基准测试无法衡量 L3 缓存的效果，因为它只涉及少量数据(大约 64 KB)。为了进行公平测试，您需要读取和写入分配数组的整个范围。

关于java - L3 cpu 缓存 java 基准测试显示奇怪的结果，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/43161089/

文章推荐： c# - 如何使用 WebBrowser 控件显示 XML？

文章推荐： java - 如何从子类而不是父类(super class)获取数据字段？

文章推荐： c# - Func 是如何隐式转换为 Expression> 的？

文章推荐： c# - 来自 .NET 的 ODBC 驱动程序列表

c++ - l+l++ 和 l+l 一样吗？
话说，尾部的++在这里没有实际作用？最佳答案 l+l++ 未定义。您的表达式中没有序列点来分隔对 l 的访问和后增量。它可以做任何事情，包括具有与 l+l 相同的效果。编辑:问题和答案在 Why
algorithm - 展示一种算法，确定是否 L = L*，给定任何常规语言 L
我正在研究成员资格算法，我正在研究这个特定问题，该问题说明如下: 展示一种算法，给定任何常规语言 L，确定 L 是否 = L* 所以，我的第一个想法是，我们有 L*，它是 L 的 Kleene 星并确
javascript - 使用 JavaScript，如何按照这些参数/规则生成随机 11 个字符字符串？ (L,L,L,L/N,L/N,N,N,N,N,N,N)
我试图弄清楚如何使用 Javascript 生成一个随机 11 个字符串，该字符串需要特定的字母/数字序列，以及位置。 ----------------------------------------
c# - 从 Where(l => l.Side == 'A' ) 与 Where(l => l.Side.Equals ('A' ) 产生的不同 SQL
我一直在 LinqPad 中试验查询。我们有一个表 Lot，其中有一列 Side char(1)。当我编写 linq to sql 查询 Lots.Where(l => l.Side == 'A')
python - 从列表 L 创建 (L[i], L[i+1]) 元组列表
这个问题在这里已经有了答案: Iterate over all pairs of consecutive items in a list [duplicate] (7 个答案) 关闭 7 年前。假
python - 将列表 L 中的子字符串 l 与字符串 S 进行比较并根据 L 中的 l 编辑 S 的最Pythonic 方法？
列表 ['a','a #2','a(Old)'] 应变为 {'a'} 因为 '# ' 和 '(Old)' 将被删除，并且不需要重复项列表。我努力用生成器开发列表理解，并决定这样做，因为我知道它会起作用
java - 螺旋穿过二维数组(l-r，向下，r-l，向下，l-r，...)
我正在为蛇和梯子制作一 block 板，到目前为止，我已经按降序打印了板。但是，我需要以正确的方式打印电路板。编辑“螺旋下降”意味着 100...91 81...90 80...71 ...
c++ - 字符串 "Hello\0"是否等于 {'H' ,'e' ,'l' ,'l' ,'o' ,'\0' } 或 {'H' ,'e' 0x104 567910 ,'l' ,'l' ,'o'}？
字符串“Hello\n”等于 {'H','e','l','l','o','\','n','\0'} 或 {'H','e','l','l','o','\n','\0'}? 是否在字符串定义中添加转义序列
python - 为什么 python list L += x 的行为与 L = L + x 不同？
这个问题在这里已经有了答案: Different behaviour for list.__iadd__ and list.__add__ (3 个答案) 关闭 8 年前。 ls = [1,2,3]
python - 为什么 g.append(l.pop()) 返回 l 的后半部分但 l 只有前半部分
当我在编写一个程序时，我在我的代码中看到了一个奇怪的行为。这是我所看到的。 >>> l = [1,2,3,4,5,6,7,8] >>> g = [] >>> for i in l: ... g
functional-programming - Jan Willem Klop 的 "(L L L...)"Y 组合器如何工作？
我明白了what a Y Combinator is , 但我不明白这个来自 Wikipedia page 的“新颖”组合子的例子: Yk = (L L L L L L L L L L L L L
java - 异常 ParseException 与 Comparator.compare(L, L) 中的 throws 子句不兼容
Exception ParseException is not compatible with throws clause in Comparator.compare(L, L). 我在java 6上
python - 给定一个 "jumbled"列表 L，得到一个列表，其中每个元素都是 L 对应元素的索引，如果 L 已排序
期望的输出我想要一个函数返回一个列表，这样，给定一个“困惑的”列表 l，每个元素都是 l 对应元素的索引，如果 l 已排序。 (抱歉，我想不出更简单的说法。) 示例 f([3,1,2]) = [2,
c++ - 为什么 M = L + ((R - L)/2) 而不是 M=(L+R)/2 在 C++ 中避免溢出？
你好，我正在查看“假设一个排序数组在你事先不知道的某个枢轴旋转。(即 0 1 2 4 5 6 7 可能变成 4 5 6 7 0 1 2)”这个问题的 C++ 解决方案。你如何有效地在旋转数组中找到一个
python - 使用由整数 [0,...,L-1] 索引的额外列将 numpy 数组 (N,M,L) 转换为 (N*L,M+1)
让我们考虑这个简单的例子: import numpy as np a=np.arange(90) a=a.reshape(6,3,5) 我想得到一个数组 b形状 (6*5,3+1=4) 与 b[0:6
kdb - 使用变量将数据库路径传递给\l 或 .Q.l
我正在编写一个 q 脚本，它在特定路径中加载一个数据库并对其进行一些处理。 db 的位置目前在脚本中是硬编码的，但我想将 db 路径作为参数传递并让它从变量中的路径加载。目前它看起来像这样: q)
javascript - 错误设备 : (3:9741) (0, l.useLinkBuilder) 不是函数。 (在 '(0,l.useLinkBuilder)()' 中， '(0,l.useLinkBuilder)' 未定义)
为什么我收到错误 Device: (3:9741) (0,l.useLinkBuilder) is not a function。 (在 '(0,l.useLinkBuilder)()' 中，'(0,
Android:版本 "X"中的 "4.X (L Preview)"和 "Preview"中的 "L Preview"是什么意思，为什么组合 "Android 4.X (L Preview) imply"？
我有 ADT 版本 23.0.4 并安装了 Android 5.0 的 SDK 平台。我读到 Android 5.0 Lolipop 的 API 级别为 21。但是在 Eclipse 的“新建应用程
Android - 如何设置适用于 L 和 -L 设备的自定义波纹背景？
我在 Google Play Store 中实现了一个抽屉导航，我想在 DrawerLayout 中设置列 TableView 的选定项目。但是后来发现在touch模式下无法选中item，有一个i
c++ - 查找库的 "name"(-L -l 开关)
作为 C++ 的新手，我基本上有一个关于 g++ 编译器的问题，尤其是库的包含。考虑以下生成文件: CPPFLAGS= -I libraries/boost_1_43_0-bin/include/ -

搜寻专家

个人简介

我是一名优秀的程序员,十分优秀！

作者热门文章

滴滴打车优惠券免费领取

全站热门文章

首页

博学

6Ren·AI

商城

java - L3 cpu 缓存 java 基准测试显示奇怪的结果