gpt4 book ai didi

performance - 平均而言,现代 x64 CPU cmpxchg16b 比 64 或 32 位 CPU 慢得多?

转载 作者:行者123 更新时间:2023-12-02 02:57:58 24 4
gpt4 key购买 nike

我相信 Windows 内部已经使用该指令很长时间了,所以 CPU 制造商会花精力优化它吗?

当然假设内存适当对齐并且不共享缓存行等。

最佳答案

出于好奇,我编写了一个小型基准测试来比较 4 字节和 8 字节 cmpxchgcmpxchg16b 的成本:

#include <cstdint>
#include <benchmark/benchmark.h>

alignas(16) char input[16 * 1024] = {};

template<class T>
void do_benchmark(benchmark::State& state) {
unsigned n = 0;
T* p = reinterpret_cast<T*>(input);
constexpr unsigned count = sizeof input / sizeof(T);
unsigned i = 0;
for(auto _ : state) {
T v{0};
n += __sync_bool_compare_and_swap(p + i++ % count, v, v);
}
benchmark::DoNotOptimize(n);
}

BENCHMARK_TEMPLATE(do_benchmark, std::int32_t);
BENCHMARK_TEMPLATE(do_benchmark, std::int64_t);
BENCHMARK_TEMPLATE(do_benchmark, __int128);
BENCHMARK_MAIN();

并在 Coffee Lake i9-9900KS CPU 上运行它。

使用gcc-8.3.0的结果:

$ make -rC ~/src/test -j8 BUILD=release run_cmpxchg16b_benchmark
g++ -o /home/max/src/test/release/gcc/cmpxchg16b_benchmark.o -c -pthread -march=native -std=gnu++17 -W{all,extra,error,no-{maybe-uninitialized,unused-function}} -g -fmessage-length=0 -O3 -mtune=native -ffast-math -falign-{functions,loops}=64 -DNDEBUG -mcx16 -MD -MP /home/max/src/test/cmpxchg16b_benchmark.cc
g++ -o /home/max/src/test/release/gcc/cmpxchg16b_benchmark -fuse-ld=gold -pthread -g /home/max/src/test/release/gcc/cmpxchg16b_benchmark.o -lrt /usr/local/lib/libbenchmark.a
sudo cpupower frequency-set --related --governor performance >/dev/null
/home/max/src/test/release/gcc/cmpxchg16b_benchmark
2020-03-15 20:18:48
Running /home/max/src/test/release/gcc/cmpxchg16b_benchmark
Run on (16 X 5100 MHz CPU s)
CPU Caches:
L1 Data 32 KiB (x8)
L1 Instruction 32 KiB (x8)
L2 Unified 256 KiB (x8)
L3 Unified 16384 KiB (x1)
Load Average: 0.43, 0.40, 0.34
---------------------------------------------------------------------
Benchmark Time CPU Iterations
---------------------------------------------------------------------
do_benchmark<std::int32_t> 3.53 ns 3.53 ns 198281069
do_benchmark<std::int64_t> 3.53 ns 3.53 ns 198256710
do_benchmark<__int128> 6.35 ns 6.35 ns 110215116

clang-8.0.0 的结果:

$ make -rC ~/src/test -j8 BUILD=release TOOLSET=clang run_cmpxchg16b_benchmark
clang++ -o /home/max/src/test/release/clang/cmpxchg16b_benchmark.o -c -pthread -march=native -std=gnu++17 -W{all,extra,error,no-unused-function} -g -fmessage-length=0 -O3 -mtune=native -ffast-math -falign-functions=64 -DNDEBUG -mcx16 -MD -MP /home/max/src/test/cmpxchg16b_benchmark.cc
clang++ -o /home/max/src/test/release/clang/cmpxchg16b_benchmark -fuse-ld=gold -pthread -g /home/max/src/test/release/clang/cmpxchg16b_benchmark.o -lrt /usr/local/lib/libbenchmark.a
sudo cpupower frequency-set --related --governor performance >/dev/null
/home/max/src/test/release/clang/cmpxchg16b_benchmark
2020-03-15 20:19:00
Running /home/max/src/test/release/clang/cmpxchg16b_benchmark
Run on (16 X 5100 MHz CPU s)
CPU Caches:
L1 Data 32 KiB (x8)
L1 Instruction 32 KiB (x8)
L2 Unified 256 KiB (x8)
L3 Unified 16384 KiB (x1)
Load Average: 0.36, 0.39, 0.33
---------------------------------------------------------------------
Benchmark Time CPU Iterations
---------------------------------------------------------------------
do_benchmark<std::int32_t> 3.84 ns 3.84 ns 182461520
do_benchmark<std::int64_t> 3.84 ns 3.84 ns 182160259
do_benchmark<__int128> 5.99 ns 5.99 ns 116972653

看起来 cmpxchg16b 比 Intel Coffee Lake 上的 8 字节 cmpxchg 贵大约 1.6-1.8 倍。


Ryzen 9 5950X 和 gcc-9.3.0 上的相同基准测试:

Running /home/max/src/test/release/gcc/cmpxchg16b_benchmark
Run on (32 X 4889.51 MHz CPU s)
CPU Caches:
L1 Data 32 KiB (x16)
L1 Instruction 32 KiB (x16)
L2 Unified 512 KiB (x16)
L3 Unified 32768 KiB (x2)
Load Average: 1.11, 0.52, 0.33
---------------------------------------------------------------------
Benchmark Time CPU Iterations
---------------------------------------------------------------------
do_benchmark<std::int32_t> 1.58 ns 1.58 ns 436624535
do_benchmark<std::int64_t> 1.58 ns 1.58 ns 443977862
do_benchmark<__int128> 2.22 ns 2.22 ns 316143309

cmpxchg16b 比 AMD Ryzen 9 上的 8 字节 cmpxchg 贵约 1.4 倍。

关于performance - 平均而言,现代 x64 CPU cmpxchg16b 比 64 或 32 位 CPU 慢得多?,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/60693096/

24 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com