c++ - 让 g++ 使用 SHLD/SHRD 指令-6ren

c++ - 让 g++ 使用 SHLD/SHRD 指令

转载作者：塔克拉玛干更新时间：2023-11-02 23:30:16

24

4

考虑以下代码:

#include <limits>
#include <cstdint>

using T = uint32_t; // or uint64_t

T shift(T x, T y, T n)
{
    return (x >> n) | (y << (std::numeric_limits<T>::digits - n));
}

根据 godbolt , clang 3.8.1 为-O1, -O2, -O3 生成如下汇编代码:

shift(unsigned int, unsigned int, unsigned int):
        movb    %dl, %cl
        shrdl   %cl, %esi, %edi
        movl    %edi, %eax
        retq

虽然 gcc 6.2(即使使用 -mtune=haswell)生成:

shift(unsigned int, unsigned int, unsigned int):
    movl    $32, %ecx
    subl    %edx, %ecx
    sall    %cl, %esi
    movl    %edx, %ecx
    shrl    %cl, %edi
    movl    %esi, %eax
    orl     %edi, %eax
    ret

这似乎远没有优化，因为 SHRD is very fast on Intel Sandybridge and later .无论如何重写函数以促进编译器(特别是 gcc)的优化并支持使用 SHLD/SHRD 汇编指令？

或者是否有任何 gcc -mtune 或其他选项可以鼓励 gcc 针对现代 Intel CPU 进行更好的调优？

使用 -march=haswell，它发出 BMI2 shlx/shrx，但仍然不是 shrd。

最佳答案

不，我看不出有什么办法可以让 gcc 使用 SHRD 指令。
您可以通过更改 -mtune and -march 来操纵 gcc 生成的输出。选项。

Or are there any gcc -mtune or other options that would encourage gcc to tune better for modern Intel CPUs?

是的，您可以让 gcc 生成 BMI2 code :

例如:X86-64 GCC6.2 -O3 -march=znver1//AMD Zen
生成:(Haswell 计时)。

    code            critical path latency     reciprocal throughput
    ---------------------------------------------------------------
    mov     eax, 32          *                     0.25
    sub     eax, edx         1                     0.25        
    shlx    eax, esi, eax    1                     0.5
    shrx    esi, edi, edx    *                     0.5
    or      eax, esi         1                     0.25
    ret
    TOTAL:                   3                     1.75

与 clang 3.8.1 相比:

    mov    cl, dl            1                     0.25
    shrd   edi, esi, cl      4                     2
    mov    eax, edi          *                     0.25 
    ret
    TOTAL                    5                     2.25

考虑到此处的依赖链:SHRD 在 Haswell 上较慢，在 Sandybridge 上较慢，在 Skylake 上较慢。
shrx 序列的倒数吞吐量更快。

所以这取决于后 BMI 处理器 gcc 产生更好的代码，前 BMI clang 获胜。
SHRD 在不同处理器上的时序差异很大，我可以理解为什么 gcc 不太喜欢它。
即使使用 -Os(优化大小)gcc 仍然不选择 SHRD。

*) 不是时序的一部分，因为要么不在关键路径上，要么变为零延迟寄存器重命名。

关于c++ - 让 g++ 使用 SHLD/SHRD 指令，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/39281925/

24

4

0

文章推荐： android - 如何阻止从 SL4A 启动的 WebView 隐藏通知栏？

文章推荐： linux - 从命令行 Linux 将字符串传递给 matlab

文章推荐： android - Android 上的 jNetPcap : problem with findAllDevs method!

文章推荐： Linux 上的 CommandLineToArgvW 等价物

c - SHLD/SHRD 指令的 SIMD 版本
SHLD/SHRD 指令是实现多精度移位的汇编指令。考虑以下问题: uint64_t array[4] = {/*something*/}; left_shift(array, 172); righ
c++ - 让 g++ 使用 SHLD/SHRD 指令
考虑以下代码: #include #include using T = uint32_t; // or uint64_t T shift(T x, T y, T n) { return (

首页

博学

6Ren·AI

商城

c++ - 让 g++ 使用 SHLD/SHRD 指令