c++ - AVX2 simd 在更高优化级别的标量性能相对较差-6ren

c++ - AVX2 simd 在更高优化级别的标量性能相对较差

转载作者：行者123 更新时间：2023-12-01 15:12:25

我正在学习和玩 SIMD 函数并编写了一个简单的程序，它比较了它可以在 1 秒内运行的 vector 加法指令的数量与正常的标量加法相比。 我发现 SIMD 在较低的优化级别上表现相对较好，而在较高的优化级别下表现更差，我想知道原因 我同时使用了 MSVC 和 gcc，这是同一个故事。以下结果来自 Ryzen 7 CPU。我还在英特尔平台上进行了测试，情况也差不多。

#include <iostream>
#include <numeric>
#include <chrono>
#include <iterator>
#include <thread>
#include <atomic>
#include <vector>
#include <immintrin.h>
int main()
{
    const auto threadLimit = std::thread::hardware_concurrency() - 1; //for running main() 
    for (auto i = 1; i <= threadLimit; ++i)
    {
        std::cerr << "Testing " << i << " threads: ";
        std::atomic<unsigned long long> sumScalar {};
        std::atomic<unsigned long long> loopScalar {};
        std::atomic<unsigned long long> sumSimd {};
        std::atomic<unsigned long long> loopSimd {};
        std::atomic_bool stopFlag{ false };
        std::vector<std::thread> threads;
        threads.reserve(i);
        {
            for (auto j = 0; j < i; ++j)
                threads.emplace_back([&]
                    {
                        uint32_t local{};
                        uint32_t loop{};
                        while (!stopFlag)
                        {
                            ++local;
                            ++loop;  //removed this(see EDIT)
                        }
                        sumScalar += local;
                        loopScalar += loop;
                    });
            std::this_thread::sleep_for(std::chrono::seconds{ 1 });
            stopFlag = true;
            for (auto& thread : threads)
                thread.join();
        }
        threads.clear();
        stopFlag = false;
        {
            for (auto j = 0; j < i; ++j)
                threads.emplace_back([&]
                    {
                        const auto oneVec = _mm256_set1_epi32(1);
                        auto local = _mm256_set1_epi32(0);
                        uint32_t inc{};
                        while (!stopFlag)
                        {
                            local = _mm256_add_epi32(oneVec, local);
                            ++inc; //removed this(see EDIT)
                        }
                        sumSimd += std::accumulate(reinterpret_cast<uint32_t*>(&local), reinterpret_cast<uint32_t*>(&local) + 8, uint64_t{});
                        loopSimd += inc;
                    });
            std::this_thread::sleep_for(std::chrono::seconds{ 1 });
            stopFlag = true;
            for (auto& thread : threads)
                thread.join();
        }
        std::cout << "Sum: "<<sumSimd <<" / "<<sumScalar <<"("<<100.0*sumSimd/sumScalar<<"%)\t"<<"Loop: "<<loopSimd<<" / "<<loopScalar<<"("<< 100.0*loopSimd/loopScalar<<"%)\n";
    // SIMD/Scalar, higher value means SIMD better
    }
}

使用 g++ -O0 -march=native -lpthread ，我得到:

Testing 1 threads: Sum: 1004405568 / 174344207(576.105%)        Loop: 125550696 / 174344207(72.0131%)
Testing 2 threads: Sum: 2001473960 / 348079929(575.004%)        Loop: 250184245 / 348079929(71.8755%)
Testing 3 threads: Sum: 2991335152 / 521830834(573.238%)        Loop: 373916894 / 521830834(71.6548%)
Testing 4 threads: Sum: 3892119680 / 693704725(561.063%)        Loop: 486514960 / 693704725(70.1329%)
Testing 5 threads: Sum: 4957263080 / 802362140(617.834%)        Loop: 619657885 / 802362140(77.2292%)
Testing 6 threads: Sum: 5417700112 / 953587414(568.139%)        Loop: 677212514 / 953587414(71.0174%)
Testing 7 threads: Sum: 6078496824 / 1067533241(569.396%)       Loop: 759812103 / 1067533241(71.1746%)
Testing 8 threads: Sum: 6679841000 / 1196224828(558.41%)        Loop: 834980125 / 1196224828(69.8013%)
Testing 9 threads: Sum: 7396623960 / 1308004474(565.489%)       Loop: 924577995 / 1308004474(70.6861%)
Testing 10 threads: Sum: 8158849904 / 1416026963(576.179%)      Loop: 1019856238 / 1416026963(72.0224%)
Testing 11 threads: Sum: 8868695984 / 1556964234(569.615%)      Loop: 1108586998 / 1556964234(71.2018%)
Testing 12 threads: Sum: 9441092968 / 1655554694(570.268%)      Loop: 1180136621 / 1655554694(71.2835%)
Testing 13 threads: Sum: 9530295080 / 1689916907(563.951%)      Loop: 1191286885 / 1689916907(70.4938%)
Testing 14 threads: Sum: 10444142536 / 1805583762(578.436%)     Loop: 1305517817 / 1805583762(72.3045%)
Testing 15 threads: Sum: 10834255144 / 1926575218(562.358%)     Loop: 1354281893 / 1926575218(70.2948%)

使用 g++ -O3 -march=native -lpthread ，我得到:

Testing 1 threads: Sum: 2933270968 / 3112671000(94.2365%)       Loop: 366658871 / 3112671000(11.7796%)
Testing 2 threads: Sum: 5839842040 / 6177278029(94.5375%)       Loop: 729980255 / 6177278029(11.8172%)
Testing 3 threads: Sum: 8775103584 / 9219587924(95.1789%)       Loop: 1096887948 / 9219587924(11.8974%)
Testing 4 threads: Sum: 11350253944 / 10210948580(111.158%)     Loop: 1418781743 / 10210948580(13.8947%)
Testing 5 threads: Sum: 14487451488 / 14623220822(99.0715%)     Loop: 1810931436 / 14623220822(12.3839%)
Testing 6 threads: Sum: 17141556576 / 14437058094(118.733%)     Loop: 2142694572 / 14437058094(14.8416%)
Testing 7 threads: Sum: 19883362288 / 18313186637(108.574%)     Loop: 2485420286 / 18313186637(13.5718%)
Testing 8 threads: Sum: 22574437968 / 17115166001(131.897%)     Loop: 2821804746 / 17115166001(16.4872%)
Testing 9 threads: Sum: 25356792368 / 18332200070(138.318%)     Loop: 3169599046 / 18332200070(17.2898%)
Testing 10 threads: Sum: 28079398984 / 20747150935(135.341%)    Loop: 3509924873 / 20747150935(16.9176%)
Testing 11 threads: Sum: 30783433560 / 21801526415(141.199%)    Loop: 3847929195 / 21801526415(17.6498%)
Testing 12 threads: Sum: 33420443880 / 22794998080(146.613%)    Loop: 4177555485 / 22794998080(18.3266%)
Testing 13 threads: Sum: 35989535640 / 23596768252(152.519%)    Loop: 4498691955 / 23596768252(19.0649%)
Testing 14 threads: Sum: 38647578408 / 23796083111(162.412%)    Loop: 4830947301 / 23796083111(20.3014%)
Testing 15 threads: Sum: 41148330392 / 24252804239(169.664%)    Loop: 5143541299 / 24252804239(21.208%)

编辑:删除 loop 变量后，在两种情况下都只留下 local(参见代码中的编辑)，结果仍然相同。
EDIT2:上面的结果是在 Ubuntu 上使用 GCC 9.3。我在 Windows (mingw) 上切换到 GCC 10.2， ，它显示了很好的缩放，见下文(结果是原始代码)。几乎可以断定是 MSVC 和 GCC 旧版本的问题？

Testing 1 threads: Sum: 23752640416 / 3153263747(753.272%)      Loop: 2969080052 / 3153263747(94.159%)
Testing 2 threads: Sum: 46533874656 / 6012052456(774.01%)       Loop: 5816734332 / 6012052456(96.7512%)
Testing 3 threads: Sum: 66076900784 / 9260324764(713.548%)      Loop: 8259612598 / 9260324764(89.1936%)
Testing 4 threads: Sum: 92216030528 / 12229625883(754.038%)     Loop: 11527003816 / 12229625883(94.2548%)
Testing 5 threads: Sum: 111822357864 / 14439219677(774.435%)    Loop: 13977794733 / 14439219677(96.8044%)
Testing 6 threads: Sum: 122858189272 / 17693796489(694.357%)    Loop: 15357273659 / 17693796489(86.7947%)
Testing 7 threads: Sum: 148478021656 / 19618236169(756.837%)    Loop: 18559752707 / 19618236169(94.6046%)
Testing 8 threads: Sum: 156931719736 / 19770409566(793.771%)    Loop: 19616464967 / 19770409566(99.2213%)
Testing 9 threads: Sum: 143331726552 / 20753115024(690.652%)    Loop: 17916465819 / 20753115024(86.3315%)
Testing 10 threads: Sum: 143541178880 / 20331801415(705.993%)   Loop: 17942647360 / 20331801415(88.2492%)
Testing 11 threads: Sum: 160425817888 / 22209102603(722.343%)   Loop: 20053227236 / 22209102603(90.2928%)
Testing 12 threads: Sum: 157095281392 / 23178532051(677.762%)   Loop: 19636910174 / 23178532051(84.7202%)
Testing 13 threads: Sum: 156015224880 / 23818567634(655.015%)   Loop: 19501903110 / 23818567634(81.8769%)
Testing 14 threads: Sum: 145464754912 / 23950304389(607.361%)   Loop: 18183094364 / 23950304389(75.9201%)
Testing 15 threads: Sum: 149279587872 / 23585183977(632.938%)   Loop: 18659948484 / 23585183977(79.1172%)

最佳答案

reinterpret_cast<uint32_t*>(&local)在循环让 GCC9 存储/重新加载 local 之后在循环内部，造成存储转发瓶颈 .
这已经在 GCC10 中修复；无需提交错过的优化错误。 不要将指针转换到 __m256i本地人；它也违反了严格混叠，所以 it's Undefined Behaviour没有 -fno-strict-aliasing即使 GCC 经常使它工作。 (You can point __m256i* at any other type, but not vice versa。)
gcc9.3(您正在使用)正在循环内存储/重新加载您的 vector ，但将标量保存在 inc eax 的寄存器中!
因此， vector 循环成为 vector 存储转发延迟加上 vpaddd 的瓶颈。，而这恰好比标量循环慢 8 倍多一点。它们的瓶颈是无关的，接近 1 倍的总速度只是巧合。
(标量循环大概在 Zen1 或 Skylake 上每次迭代运行 1 个周期，而 7 个周期的存储转发加上 1 的 vpaddd 听起来差不多)。

它是由reinterpret_cast<uint32_t*>(&local) 间接引起的 , 要么是因为 GCC 试图原谅严格混叠未定义行为的违规行为，要么只是因为您完全采用了指向本地的指针。
这是不正常的或预期的，但是内部循环内的原子负载和可能的 lambda 的组合使 GCC9 犯了这个错误。 (请注意，GCC9 和 10 正在从循环内的线程函数 arg 重新加载 stopFlag 的地址，即使对于标量也是如此，因此将内容保存在寄存器中已经有些失败。)
在正常用例中，每次检查停止标志时，您将做更多的 SIMD 工作，并且通常您不会在迭代中保持 vector 状态。通常你会有一个非原子参数告诉你要做多少工作，而不是你在内部循环中检查的停止标志。所以这个错过选择的错误很少成为问题。 (除非即使没有原子标志也会发生这种情况？)

可重现 on Godbolt , 显示 -DUB_TYPEPUN与 -UUB_TYPEPUN对于我使用 #ifdef 的源从 Fastest method to calculate sum of all packed 32-bit integers using AVX512 or AVX2 使用您的不安全(和错过选择触发)版本与带有手动矢量化随机播放的安全版本. (该手动 hsum 在添加之前不会变宽，因此它可能会溢出和换行。但这不是重点；使用不同的手动 shuffle 或 _mm256_store_si256 到单独的数组，可以在没有严格混叠的情况下获得您想要的结果未定义的行为。)
标量循环是:

# g++9.3 -O3 -march=znver1
.L5:                                      # do{
        inc     eax                         # local++
.L3:
        mov     rdx, QWORD PTR [rdi+8]      # load the address of stopFlag from the lambda
        movzx   edx, BYTE PTR [rdx]         # zero-extend *&stopFlag into EDX
        test    dl, dl
        je      .L5                       # }while(stopFlag == 0)

vector 循环，使用 g++ 9.3， -O3 -march=znver1 , 使用您的 reinterpret_cast (即 -DUB_TYPEPUN 在我的源代码版本中):

# g++9.3 -O3 -march=znver1  with your pointer-cast onto the vector

 # ... ymm1 = _mm256_set1_epi32(1)
.L10:                                               # do {
        vpaddd  ymm1, ymm0, YMMWORD PTR [rsp-32]       # memory-source add with set1(1)
        vmovdqa YMMWORD PTR [rsp-32], ymm1             # store back into stack memory
.L8:
        mov     rax, QWORD PTR [rdi+8]                  # load flag address
        movzx   eax, BYTE PTR [rax]                     # load stopFlag
        test    al, al
        je      .L10                                # }while(stopFlag == 0)

... auto-vectorized hsum, zero-extending elements to 64-bit for vpaddq

但有保险箱 __m256i避免指向 local 的指针的水平总和完全没有， local留在寄存器中。

#      ymm1 = _mm256_set1_epi32(1)
.L9:
        vpaddd  ymm0, ymm1, ymm0             # local += set1(1),  staying in a register, ymm0
.L8:
        mov     rax, QWORD PTR [rdi+8]       # same loop overhead, still 3 uops (with fusion of test/je)
        movzx   eax, BYTE PTR [rax]
        test    al, al
        je      .L9

... manually-vectorized 32-bit hsum

在我的 Intel Skylake i7-6700k 上，每个线程数我得到预期的 800 +- 1%，g++ 10.1 -O3 -march=skylake，Arch GNU/Linux，energy_performance_preference=balance_power(最大时钟 = 3.9GHz，任意# 核心活跃)。
标量和 vector 循环具有相同数量的微指令并且没有不同的瓶颈，因此它们以相同的周期/迭代运行。 (4，如果它可以保持这些地址 -> 停止标志负载的值(value)链，则可能在每个周期运行 1 次迭代)。
Zen1 可能不同，因为 vpaddd ymm是 2 微秒。但它的前端足够宽，可能仍以每次迭代 1 个周期运行该循环，因此您也可能在那里看到 800%。
与 ++loop未注释，我得到〜267％的“SIMD速度”。在 SIMD 循环中增加一个额外的 inc，它变成了 5 uop，并且可能会在 Skylake 上受到一些令人讨厌的前端影响。
-O0基准测试通常没有意义，它有不同的瓶颈(通常存储/重新加载将所有内容保存在内存中)，并且 SIMD 内在函数通常在 -O0 处有很多额外的开销.虽然在这种情况下，即使 -O3 SIMD 循环的存储/重新加载遇到瓶颈。

关于c++ - AVX2 simd 在更高优化级别的标量性能相对较差，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/63360909/

文章推荐： c++ - For 循环条件(整数 vector )。如何获取之前的值？

文章推荐： c++ - C++中的 union (请解释一下)

文章推荐： c++ - 为什么我的 weak_ptr 出现段错误

文章推荐： docker - 如何使用 Jenkins 在远程 docker 主机中运行容器

07、Perl 标量
Perl 中的标量是一个简单的数据单元标量的值可以是一个整数，浮点数，字符，字符串，段落或者一个完整的网页范例： Perl 中标量的使用 #!/usr/bin/perl =pod
arrays - 如何访问数据框列中的数组元素(标量)
This question already has answers here: Querying Spark SQL DataFrame with complex types (3个答案) 2年前关闭
r - 将数据帧的几列乘以一个因子(标量)
我有一个非常基本的问题，找不到解决方案，因此对于初学者的问题，请提前抱歉。我有一个包含多个 ID 列和 30 个数字列的数据框。我想用相同的因子乘以这 30 列的所有值。我想保持数据框的其余部分不变
graphql - 覆盖标准 ID 标量
我想使用 UUID 作为标识符，但标准标量 ID 被强制转换为字符串。所以在我使用 ID 类型的任何地方都必须从字符串中解析 uuid。我想知道是否可以用我自己的实现覆盖 ID 类型？这个标量类型有
python - 将(标量)函数数组转换为返回数组的函数
我有一个函数数组farr，比如说 import numpy as np farr=np.array([(lambda x, y: x+y) for n in range(5)]) (实际上，函数都是不
perl - 标量 vs 列表赋值运算符
请帮助我理解以下片段: my $count = @array; my @copy = @array; my ($first) = @array; (my $copy = $str) =~ s/\\/\
python - 数组+标量？ C
我有一个程序，我一直在玩弄，我偶然发现了这样的东西: unsigned char tmp[4]; ... if (mpu_write_mem(D_1_36, 2, tmp+2)) return
python - 将数组转换为 python 标量
我需要很大的帮助，请查看这段代码: import.math dose =20.0 a = [[[2,3,4],[5,8,9],[12,56,32]] [[25,36,45][21,65,98
c++ - 标量、 vector 、张量的抽象基类，
我要设计一个类PrimitiveType它作为标量、 vector 、张量等数学实体的抽象类，将它们存储在 std::vector myVector 中。我可以通过它进行迭代。例如，有两个相同大小的
c++ - 原始(标量)类型的差异复制初始化和直接初始化
这个问题在这里已经有了答案: int a = 0 and int a(0) differences [duplicate] (7 个答案) 关闭 3 年前。据我所知在C++中是一个初始化的形式 T
xml - 无法读取子内部的公共(public)标量
perl 代码如下:问题是我无法读取 sub tweak_server{} 中的 $key .... my $key; my %hash = ( flintstones => [ "C:/Users1
symfony - 带有双 % 的 YAML 标量
我正在尝试使用 symfony3 连接到数据库，但问题是当我将密码放入parameters.yml 中时，出现此错误: 数据库密码:xx%xxxxx%x You have requested a no
python - 有没有办法 pd.cut 标量？
我正在寻找 pd.cut 的等价物，但要寻找标量？我想这样做: bins = [0, 5, 10, 15, 20, 25, 30, 40, 50, 100, 150] pd.cut(43, bins
unit-testing - 如何更改单元测试模块中的 Perl Readonly 标量？
到目前为止，我在互联网上找到的唯一帮助是 this blog .我认为这会让我到达那里，但我认为它实际上并没有改变我模块中的值。我做了一个示例来说明我的意思。 package Module; use
while 循环中的 bool (标量)上下文中的 Perl 列表
我盯着 perl LWP::Protocol.pm 中的这段代码，我不明白循环将如何退出: while ($content = &$collector, length $$content) {
assembly - Power7 架构上的混合 assembly 标量/矢量
两年来，我正在开发一个库:cyme通过“友好容器”执行 SIMD 计算。我能够达到处理器的最大性能。通常用户定义容器并根据以下语法编写内核(简单示例): for(i...) W[i] = R[i]
opencl - 设置(标量)内核参数 OpenCL 后值错误
我正在开发一个 OpenCL 程序，但每次执行的输出都不同。我认为这与将参数传递给内核有关，因为当我对特定执行的值进行硬编码时，每次执行后的输出都是相似的。我的内核看起来像这样: __kernel
java - 如何在 SPQR 中使用 JSON 标量
我想在服务类中返回 JSON 文字 @GraphQLQuery(name = "renderUI", description = "Schema for your form") public Stri
perl - 将 PDL 标量转换为 Perl 标量
我有一个使用 PDL 的函数.最后一步是点积，因此它返回一个标量。但是，当我尝试打印这个标量时，它显然仍然是一个小玩意，并在屏幕上打印如下: [ [ 3 ] ] 我想知道如何将它转换回常规的 Pe
python - Pandas 标量 UDF 失败，IllegalArgumentException
首先，如果我的问题很简单，我深表歉意。我确实花了很多时间研究它。我正在尝试在 PySpark 脚本中设置标量 Pandas UDF，如所述 here . 这是我的代码: from pyspark i

行者123

个人简介

我是一名优秀的程序员,十分优秀！

作者热门文章

滴滴打车优惠券免费领取

全站热门文章

首页

博学

6Ren·AI

商城

c++ - AVX2 simd 在更高优化级别的标量性能相对较差