c++ - GCC 5 及更高版本中的 AVX2 支持

转载作者：太空狗更新时间：2023-10-29 20:02:16

我写了下面的类“T”来加速操作使用 AVX2 的“字符集”。然后我发现它不起作用gcc 5 及更高版本，当我使用“-O3”时。任何人都可以帮助我将其追溯到一些编程结构已知不能在最新的编译器/系统上工作？

此代码的工作原理:底层结构(“_bits”)是一个 256 字节的 block (为 AVX2 对齐和分配)，可以作为 char[256] 或 AVX2 元素访问，具体取决于元素是否是访问或整个事物用于 vector 操作。看起来它应该在 AVX2 平台上完美运行。没有？

这真的很难调试，因为“valgrind”说它很干净，而且我不能使用调试器(由于问题消失时我删除了“-O3”)。但我不喜欢只使用“|=”解决方法，因为如果这段代码真的是错误的，那么我可能在其他地方犯同样的错误，把一切都搞砸我开发!

有趣的是，“|”运营商有问题，但“|=”没有。问题可能与从返回结构有关功能？但我认为返回结构自 1990 年以来一直有效什么的。

// g++ -std=c++11 -mavx2 -O3 gcc_fail.cpp

#include "assert.h"
#include "immintrin.h" // AVX

class T {
public:
  __m256i _bits[8];
  inline bool& operator[](unsigned char c)       {return ((bool*)_bits)[c];}
  inline bool  operator[](unsigned char c) const {return ((bool*)_bits)[c];}
  inline          T()                   {}
  inline explicit T(char const*);
  inline T     operator| (T const& b) const;
  inline T &   operator|=(T const& b);
  inline bool  operator! ()           const;
};

T::T(char const* s)
{
  _bits[0] = _bits[1] = _bits[2] = _bits[3] = _mm256_set1_epi32(0);
  _bits[4] = _bits[5] = _bits[6] = _bits[7] = _mm256_set1_epi32(0);
  char c;
  while ((c = *s++))
    (*this)[c] = true;
}

T T::operator| (T const& b) const
{
  T res;
  for (int i = 0; i < 8; i++)
    res._bits[i] = _mm256_or_si256(_bits[i], b._bits[i]);


  // FIXME why does the above code fail with -O3 in new gcc?
  for (int i=0; i<256; i++)
    assert(res[i] == ((*this)[i] || b[i]));
  // gcc 4.7.0 - PASS
  // gcc 4.7.2 - PASS
  // gcc 4.8.0 - PASS
  // gcc 4.9.2 - PASS
  // gcc 5.2.0 - FAIL
  // gcc 5.3.0 - FAIL
  // gcc 5.3.1 - FAIL
  // gcc 6.1.0 - FAIL


  return res;
}

T & T::operator|=(T const& b)
{
  for (int i = 0; i < 8; i++)
    _bits[i] = _mm256_or_si256(_bits[i], b._bits[i]);
  return *this;
}

bool T::operator! () const
{
  for (int i = 0; i < 8; i++)
    if (!_mm256_testz_si256(_bits[i], _bits[i]))
      return false;
  return true;
}

int Main()
{
  T sep (" ,\t\n");
  T end ("");
  return !(sep|end);
}

int main()
{
  return Main();
}

最佳答案

您的代码的问题是在您应该使用 unsigned char* 时使用了 bool*，这允许 GCC 5 继续进行指针别名优化。

由 GCC 4.8.5 和 5.3.1 生成的函数 Main() 的机器代码的两个转储位于本答案末尾的附录中以供引用。

看代码:

反编译

序言之后，T sep 的_bits 被初始化为零...

  _bits[0] = _bits[1] = _bits[2] = _bits[3] = _mm256_set1_epi32(0);
  _bits[4] = _bits[5] = _bits[6] = _bits[7] = _mm256_set1_epi32(0);

  40063d:       c5 fd 7f 44 24 60               vmovdqa %ymm0,0x60(%rsp)
  400643:       c5 fd 7f 44 24 40               vmovdqa %ymm0,0x40(%rsp)
  400649:       c5 fd 7f 44 24 20               vmovdqa %ymm0,0x20(%rsp)
  40064f:       c5 fd 7f 04 24                  vmovdqa %ymm0,(%rsp)
  400654:       c5 fd 7f 84 24 e0 00 00 00      vmovdqa %ymm0,0xe0(%rsp)
  40065d:       c5 fd 7f 84 24 c0 00 00 00      vmovdqa %ymm0,0xc0(%rsp)
  400666:       c5 fd 7f 84 24 a0 00 00 00      vmovdqa %ymm0,0xa0(%rsp)
  40066f:       c5 fd 7f 84 24 80 00 00 00      vmovdqa %ymm0,0x80(%rsp)

然后根据char* s循环写入。

  char c;
  while ((c = *s++))
    (*this)[c] = true;

  400680:       48 83 c2 01                     add    $0x1,%rdx
  400684:       c6 04 04 01                     movb   $0x1,(%rsp,%rax,1)
  400688:       0f b6 42 ff                     movzbl -0x1(%rdx),%eax
  40068c:       84 c0                           test   %al,%al
  40068e:       75 f0                           jne    400680 <_Z4Mainv+0x60>

然后两个编译器都将 T end 初始化为 0:

  400690:       c5 f9 ef c0                     vpxor  %xmm0,%xmm0,%xmm0
  400694:       31 c0                           xor    %eax,%eax
  400696:       c5 fd 7f 84 24 60 01 00 00      vmovdqa %ymm0,0x160(%rsp)
  40069f:       c5 fd 7f 84 24 40 01 00 00      vmovdqa %ymm0,0x140(%rsp)
  4006a8:       c5 fd 7f 84 24 20 01 00 00      vmovdqa %ymm0,0x120(%rsp)
  4006b1:       c5 fd 7f 84 24 00 01 00 00      vmovdqa %ymm0,0x100(%rsp)
  4006ba:       c5 fd 7f 84 24 e0 01 00 00      vmovdqa %ymm0,0x1e0(%rsp)
  4006c3:       c5 fd 7f 84 24 c0 01 00 00      vmovdqa %ymm0,0x1c0(%rsp)
  4006cc:       c5 fd 7f 84 24 a0 01 00 00      vmovdqa %ymm0,0x1a0(%rsp)
  4006d5:       c5 fd 7f 84 24 80 01 00 00      vmovdqa %ymm0,0x180(%rsp)

然后两个编译器都优化了 _mm256_or_si256() 操作，因为已知 T end 为 0。但是随后，GCC 4.8.5 从 T sep 复制到 T res(这就是当您将任何内容或运算为零变量时发生的计算) ，而 GCC 5.3.1 将 T res 初始化为 0。它有权这样做，因为在您的 operator [] 方法中，您将 __m256i* 类型的指针转换为 bool*，并且允许编译器假设指针没有别名。因此在 GCC 4.8.5 中你会看到

  4006de:       c5 fd 6f 04 24                  vmovdqa (%rsp),%ymm0
  4006e3:       c5 fd 7f 84 24 00 02 00 00      vmovdqa %ymm0,0x200(%rsp)
  4006ec:       c5 fd 6f 44 24 20               vmovdqa 0x20(%rsp),%ymm0
  4006f2:       c5 fd 7f 84 24 20 02 00 00      vmovdqa %ymm0,0x220(%rsp)
  4006fb:       c5 fd 6f 44 24 40               vmovdqa 0x40(%rsp),%ymm0
  400701:       c5 fd 7f 84 24 40 02 00 00      vmovdqa %ymm0,0x240(%rsp)
  40070a:       c5 fd 6f 44 24 60               vmovdqa 0x60(%rsp),%ymm0
  400710:       c5 fd 7f 84 24 60 02 00 00      vmovdqa %ymm0,0x260(%rsp)
  400719:       c5 fd 6f 84 24 80 00 00 00      vmovdqa 0x80(%rsp),%ymm0
  400722:       c5 fd 7f 84 24 80 02 00 00      vmovdqa %ymm0,0x280(%rsp)
  40072b:       c5 fd 6f 84 24 a0 00 00 00      vmovdqa 0xa0(%rsp),%ymm0
  400734:       c5 fd 7f 84 24 a0 02 00 00      vmovdqa %ymm0,0x2a0(%rsp)
  40073d:       c5 fd 6f 84 24 c0 00 00 00      vmovdqa 0xc0(%rsp),%ymm0
  400746:       c5 fd 7f 84 24 c0 02 00 00      vmovdqa %ymm0,0x2c0(%rsp)
  40074f:       c5 fd 6f 84 24 e0 00 00 00      vmovdqa 0xe0(%rsp),%ymm0
  400758:       c5 fd 7f 84 24 e0 02 00 00      vmovdqa %ymm0,0x2e0(%rsp)

在 GCC 5.3.1 中你会看到

  4006fa:       c5 fd 7f 85 f0 fe ff ff         vmovdqa %ymm0,-0x110(%rbp)
  400702:       c5 fd 7f 85 10 ff ff ff         vmovdqa %ymm0,-0xf0(%rbp)
  40070a:       c5 fd 7f 85 30 ff ff ff         vmovdqa %ymm0,-0xd0(%rbp)
  400712:       c5 fd 7f 85 50 ff ff ff         vmovdqa %ymm0,-0xb0(%rbp)
  40071a:       c5 fd 7f 85 70 ff ff ff         vmovdqa %ymm0,-0x90(%rbp)
  400722:       c5 fd 7f 45 90                  vmovdqa %ymm0,-0x70(%rbp)
  400727:       c5 fd 7f 45 b0                  vmovdqa %ymm0,-0x50(%rbp)
  40072c:       c5 fd 7f 45 d0                  vmovdqa %ymm0,-0x30(%rbp)

assert() 的读取随后失败。

标准对指针别名的规定:

ISO C++11 在下面的章节中提到了别名，它明确了 __m256i* 类型的变量不能使用 bool* 访问，但可以访问使用 char*/unsigned char*:

§ 3.10 Lvalues and rvalues [basic.lval]

[...]

If a program attempts to access the stored value of an object through a glvalue of other than one of the following types the behavior is undefined: [52]

the dynamic type of the object,

a cv-qualified version of the dynamic type of the object,

a type similar (as defined in 4.4) to the dynamic type of the object,

a type that is the signed or unsigned type corresponding to the dynamic type of the object,

a type that is the signed or unsigned type corresponding to a cv-qualified version of the dynamic type of the object,

an aggregate or union type that includes one of the aforementioned types among its elements or non-static data members (including, recursively, an element or non-static data member of a subaggregate or contained union),

a type that is a (possibly cv-qualified) base class type of the dynamic type of the object,

a char or unsigned char type.

52) The intent of this list is to specify those circumstances in which an object may or may not be aliased.

附录

海湾合作委员会 4.8.5:

0000000000400620 <_Z4Mainv>:
  400620:       55                              push   %rbp
  400621:       c5 f9 ef c0                     vpxor  %xmm0,%xmm0,%xmm0
  400625:       ba e5 08 40 00                  mov    $0x4008e5,%edx
  40062a:       b8 20 00 00 00                  mov    $0x20,%eax
  40062f:       48 89 e5                        mov    %rsp,%rbp
  400632:       48 83 e4 e0                     and    $0xffffffffffffffe0,%rsp
  400636:       48 81 ec 00 03 00 00            sub    $0x300,%rsp
  40063d:       c5 fd 7f 44 24 60               vmovdqa %ymm0,0x60(%rsp)
  400643:       c5 fd 7f 44 24 40               vmovdqa %ymm0,0x40(%rsp)
  400649:       c5 fd 7f 44 24 20               vmovdqa %ymm0,0x20(%rsp)
  40064f:       c5 fd 7f 04 24                  vmovdqa %ymm0,(%rsp)
  400654:       c5 fd 7f 84 24 e0 00 00 00      vmovdqa %ymm0,0xe0(%rsp)
  40065d:       c5 fd 7f 84 24 c0 00 00 00      vmovdqa %ymm0,0xc0(%rsp)
  400666:       c5 fd 7f 84 24 a0 00 00 00      vmovdqa %ymm0,0xa0(%rsp)
  40066f:       c5 fd 7f 84 24 80 00 00 00      vmovdqa %ymm0,0x80(%rsp)
  400678:       0f 1f 84 00 00 00 00 00         nopl   0x0(%rax,%rax,1)
  400680:       48 83 c2 01                     add    $0x1,%rdx
  400684:       c6 04 04 01                     movb   $0x1,(%rsp,%rax,1)
  400688:       0f b6 42 ff                     movzbl -0x1(%rdx),%eax
  40068c:       84 c0                           test   %al,%al
  40068e:       75 f0                           jne    400680 <_Z4Mainv+0x60>
  400690:       c5 f9 ef c0                     vpxor  %xmm0,%xmm0,%xmm0
  400694:       31 c0                           xor    %eax,%eax
  400696:       c5 fd 7f 84 24 60 01 00 00      vmovdqa %ymm0,0x160(%rsp)
  40069f:       c5 fd 7f 84 24 40 01 00 00      vmovdqa %ymm0,0x140(%rsp)
  4006a8:       c5 fd 7f 84 24 20 01 00 00      vmovdqa %ymm0,0x120(%rsp)
  4006b1:       c5 fd 7f 84 24 00 01 00 00      vmovdqa %ymm0,0x100(%rsp)
  4006ba:       c5 fd 7f 84 24 e0 01 00 00      vmovdqa %ymm0,0x1e0(%rsp)
  4006c3:       c5 fd 7f 84 24 c0 01 00 00      vmovdqa %ymm0,0x1c0(%rsp)
  4006cc:       c5 fd 7f 84 24 a0 01 00 00      vmovdqa %ymm0,0x1a0(%rsp)
  4006d5:       c5 fd 7f 84 24 80 01 00 00      vmovdqa %ymm0,0x180(%rsp)
  4006de:       c5 fd 6f 04 24                  vmovdqa (%rsp),%ymm0
  4006e3:       c5 fd 7f 84 24 00 02 00 00      vmovdqa %ymm0,0x200(%rsp)
  4006ec:       c5 fd 6f 44 24 20               vmovdqa 0x20(%rsp),%ymm0
  4006f2:       c5 fd 7f 84 24 20 02 00 00      vmovdqa %ymm0,0x220(%rsp)
  4006fb:       c5 fd 6f 44 24 40               vmovdqa 0x40(%rsp),%ymm0
  400701:       c5 fd 7f 84 24 40 02 00 00      vmovdqa %ymm0,0x240(%rsp)
  40070a:       c5 fd 6f 44 24 60               vmovdqa 0x60(%rsp),%ymm0
  400710:       c5 fd 7f 84 24 60 02 00 00      vmovdqa %ymm0,0x260(%rsp)
  400719:       c5 fd 6f 84 24 80 00 00 00      vmovdqa 0x80(%rsp),%ymm0
  400722:       c5 fd 7f 84 24 80 02 00 00      vmovdqa %ymm0,0x280(%rsp)
  40072b:       c5 fd 6f 84 24 a0 00 00 00      vmovdqa 0xa0(%rsp),%ymm0
  400734:       c5 fd 7f 84 24 a0 02 00 00      vmovdqa %ymm0,0x2a0(%rsp)
  40073d:       c5 fd 6f 84 24 c0 00 00 00      vmovdqa 0xc0(%rsp),%ymm0
  400746:       c5 fd 7f 84 24 c0 02 00 00      vmovdqa %ymm0,0x2c0(%rsp)
  40074f:       c5 fd 6f 84 24 e0 00 00 00      vmovdqa 0xe0(%rsp),%ymm0
  400758:       c5 fd 7f 84 24 e0 02 00 00      vmovdqa %ymm0,0x2e0(%rsp)
  400761:       0f 1f 80 00 00 00 00            nopl   0x0(%rax)
  400768:       80 3c 04 00                     cmpb   $0x0,(%rsp,%rax,1)
  40076c:       0f b6 8c 04 00 02 00 00         movzbl 0x200(%rsp,%rax,1),%ecx
  400774:       ba 01 00 00 00                  mov    $0x1,%edx
  400779:       75 08                           jne    400783 <_Z4Mainv+0x163>
  40077b:       0f b6 94 04 00 01 00 00         movzbl 0x100(%rsp,%rax,1),%edx
  400783:       38 d1                           cmp    %dl,%cl
  400785:       0f 85 b2 00 00 00               jne    40083d <_Z4Mainv+0x21d>
  40078b:       48 83 c0 01                     add    $0x1,%rax
  40078f:       48 3d 00 01 00 00               cmp    $0x100,%rax
  400795:       75 d1                           jne    400768 <_Z4Mainv+0x148>
  400797:       c5 fd 6f 8c 24 00 02 00 00      vmovdqa 0x200(%rsp),%ymm1
  4007a0:       31 c0                           xor    %eax,%eax
  4007a2:       c4 e2 7d 17 c9                  vptest %ymm1,%ymm1
  4007a7:       0f 94 c0                        sete   %al
  4007aa:       0f 85 88 00 00 00               jne    400838 <_Z4Mainv+0x218>
  4007b0:       c5 fd 6f 8c 24 20 02 00 00      vmovdqa 0x220(%rsp),%ymm1
  4007b9:       31 c0                           xor    %eax,%eax
  4007bb:       c4 e2 7d 17 c9                  vptest %ymm1,%ymm1
  4007c0:       0f 94 c0                        sete   %al
  4007c3:       75 73                           jne    400838 <_Z4Mainv+0x218>
  4007c5:       c5 fd 6f 8c 24 40 02 00 00      vmovdqa 0x240(%rsp),%ymm1
  4007ce:       31 c0                           xor    %eax,%eax
  4007d0:       c4 e2 7d 17 c9                  vptest %ymm1,%ymm1
  4007d5:       0f 94 c0                        sete   %al
  4007d8:       75 5e                           jne    400838 <_Z4Mainv+0x218>
  4007da:       c5 fd 6f 8c 24 60 02 00 00      vmovdqa 0x260(%rsp),%ymm1
  4007e3:       31 c0                           xor    %eax,%eax
  4007e5:       c4 e2 7d 17 c9                  vptest %ymm1,%ymm1
  4007ea:       0f 94 c0                        sete   %al
  4007ed:       75 49                           jne    400838 <_Z4Mainv+0x218>
  4007ef:       c5 fd 6f 8c 24 80 02 00 00      vmovdqa 0x280(%rsp),%ymm1
  4007f8:       31 c0                           xor    %eax,%eax
  4007fa:       c4 e2 7d 17 c9                  vptest %ymm1,%ymm1
  4007ff:       0f 94 c0                        sete   %al
  400802:       75 34                           jne    400838 <_Z4Mainv+0x218>
  400804:       c5 fd 6f 8c 24 a0 02 00 00      vmovdqa 0x2a0(%rsp),%ymm1
  40080d:       31 c0                           xor    %eax,%eax
  40080f:       c4 e2 7d 17 c9                  vptest %ymm1,%ymm1
  400814:       0f 94 c0                        sete   %al
  400817:       75 1f                           jne    400838 <_Z4Mainv+0x218>
  400819:       c5 fd 6f 8c 24 c0 02 00 00      vmovdqa 0x2c0(%rsp),%ymm1
  400822:       31 c0                           xor    %eax,%eax
  400824:       c4 e2 7d 17 c9                  vptest %ymm1,%ymm1
  400829:       0f 94 c0                        sete   %al
  40082c:       75 0a                           jne    400838 <_Z4Mainv+0x218>
  40082e:       31 c0                           xor    %eax,%eax
  400830:       c4 e2 7d 17 c0                  vptest %ymm0,%ymm0
  400835:       0f 94 c0                        sete   %al
  400838:       c5 f8 77                        vzeroupper 
  40083b:       c9                              leaveq 
  40083c:       c3                              retq   
  40083d:       b9 20 09 40 00                  mov    $0x400920,%ecx
  400842:       ba 26 00 00 00                  mov    $0x26,%edx
  400847:       be e9 08 40 00                  mov    $0x4008e9,%esi
  40084c:       bf f8 08 40 00                  mov    $0x4008f8,%edi
  400851:       c5 f8 77                        vzeroupper 
  400854:       e8 97 fc ff ff                  callq  4004f0 <__assert_fail@plt>
  400859:       0f 1f 80 00 00 00 00            nopl   0x0(%rax)

海湾合作委员会 5:

0000000000400630 <_Z4Mainv>:
  400630:       4c 8d 54 24 08                  lea    0x8(%rsp),%r10
  400635:       48 83 e4 e0                     and    $0xffffffffffffffe0,%rsp
  400639:       b8 20 00 00 00                  mov    $0x20,%eax
  40063e:       c5 f9 ef c0                     vpxor  %xmm0,%xmm0,%xmm0
  400642:       ba 25 08 40 00                  mov    $0x400825,%edx
  400647:       41 ff 72 f8                     pushq  -0x8(%r10)
  40064b:       55                              push   %rbp
  40064c:       48 89 e5                        mov    %rsp,%rbp
  40064f:       41 52                           push   %r10
  400651:       48 81 ec 08 03 00 00            sub    $0x308,%rsp
  400658:       c5 fd 7f 85 50 fd ff ff         vmovdqa %ymm0,-0x2b0(%rbp)
  400660:       c5 fd 7f 85 30 fd ff ff         vmovdqa %ymm0,-0x2d0(%rbp)
  400668:       c5 fd 7f 85 10 fd ff ff         vmovdqa %ymm0,-0x2f0(%rbp)
  400670:       c5 fd 7f 85 f0 fc ff ff         vmovdqa %ymm0,-0x310(%rbp)
  400678:       c5 fd 7f 85 d0 fd ff ff         vmovdqa %ymm0,-0x230(%rbp)
  400680:       c5 fd 7f 85 b0 fd ff ff         vmovdqa %ymm0,-0x250(%rbp)
  400688:       c5 fd 7f 85 90 fd ff ff         vmovdqa %ymm0,-0x270(%rbp)
  400690:       c5 fd 7f 85 70 fd ff ff         vmovdqa %ymm0,-0x290(%rbp)
  400698:       0f 1f 84 00 00 00 00 00         nopl   0x0(%rax,%rax,1)
  4006a0:       48 83 c2 01                     add    $0x1,%rdx
  4006a4:       c6 84 05 f0 fc ff ff 01         movb   $0x1,-0x310(%rbp,%rax,1)
  4006ac:       0f b6 42 ff                     movzbl -0x1(%rdx),%eax
  4006b0:       84 c0                           test   %al,%al
  4006b2:       75 ec                           jne    4006a0 <_Z4Mainv+0x70>
  4006b4:       c5 f9 ef c0                     vpxor  %xmm0,%xmm0,%xmm0
  4006b8:       31 c0                           xor    %eax,%eax
  4006ba:       c5 fd 7f 85 50 fe ff ff         vmovdqa %ymm0,-0x1b0(%rbp)
  4006c2:       c5 fd 7f 85 30 fe ff ff         vmovdqa %ymm0,-0x1d0(%rbp)
  4006ca:       c5 fd 7f 85 10 fe ff ff         vmovdqa %ymm0,-0x1f0(%rbp)
  4006d2:       c5 fd 7f 85 f0 fd ff ff         vmovdqa %ymm0,-0x210(%rbp)
  4006da:       c5 fd 7f 85 d0 fe ff ff         vmovdqa %ymm0,-0x130(%rbp)
  4006e2:       c5 fd 7f 85 b0 fe ff ff         vmovdqa %ymm0,-0x150(%rbp)
  4006ea:       c5 fd 7f 85 90 fe ff ff         vmovdqa %ymm0,-0x170(%rbp)
  4006f2:       c5 fd 7f 85 70 fe ff ff         vmovdqa %ymm0,-0x190(%rbp)
  4006fa:       c5 fd 7f 85 f0 fe ff ff         vmovdqa %ymm0,-0x110(%rbp)
  400702:       c5 fd 7f 85 10 ff ff ff         vmovdqa %ymm0,-0xf0(%rbp)
  40070a:       c5 fd 7f 85 30 ff ff ff         vmovdqa %ymm0,-0xd0(%rbp)
  400712:       c5 fd 7f 85 50 ff ff ff         vmovdqa %ymm0,-0xb0(%rbp)
  40071a:       c5 fd 7f 85 70 ff ff ff         vmovdqa %ymm0,-0x90(%rbp)
  400722:       c5 fd 7f 45 90                  vmovdqa %ymm0,-0x70(%rbp)
  400727:       c5 fd 7f 45 b0                  vmovdqa %ymm0,-0x50(%rbp)
  40072c:       c5 fd 7f 45 d0                  vmovdqa %ymm0,-0x30(%rbp)
  400731:       0f 1f 80 00 00 00 00            nopl   0x0(%rax)
  400738:       0f b6 94 05 f0 fc ff ff         movzbl -0x310(%rbp,%rax,1),%edx
  400740:       0f b6 8c 05 f0 fe ff ff         movzbl -0x110(%rbp,%rax,1),%ecx
  400748:       84 d2                           test   %dl,%dl
  40074a:       75 08                           jne    400754 <_Z4Mainv+0x124>
  40074c:       0f b6 94 05 f0 fd ff ff         movzbl -0x210(%rbp,%rax,1),%edx
  400754:       38 d1                           cmp    %dl,%cl
  400756:       75 2c                           jne    400784 <_Z4Mainv+0x154>
  400758:       48 83 c0 01                     add    $0x1,%rax
  40075c:       48 3d 00 01 00 00               cmp    $0x100,%rax
  400762:       75 d4                           jne    400738 <_Z4Mainv+0x108>
  400764:       c5 f9 ef c0                     vpxor  %xmm0,%xmm0,%xmm0
  400768:       31 c0                           xor    %eax,%eax
  40076a:       c4 e2 7d 17 c0                  vptest %ymm0,%ymm0
  40076f:       0f 94 c0                        sete   %al
  400772:       c5 f8 77                        vzeroupper 
  400775:       48 81 c4 08 03 00 00            add    $0x308,%rsp
  40077c:       41 5a                           pop    %r10
  40077e:       5d                              pop    %rbp
  40077f:       49 8d 62 f8                     lea    -0x8(%r10),%rsp
  400783:       c3                              retq   
  400784:       b9 60 08 40 00                  mov    $0x400860,%ecx
  400789:       ba 26 00 00 00                  mov    $0x26,%edx
  40078e:       be 29 08 40 00                  mov    $0x400829,%esi
  400793:       bf 38 08 40 00                  mov    $0x400838,%edi
  400798:       c5 f8 77                        vzeroupper 
  40079b:       e8 50 fd ff ff                  callq  4004f0 <__assert_fail@plt>

关于c++ - GCC 5 及更高版本中的 AVX2 支持，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/43152787/

文章推荐： c# - 如何在 GridView 中创建删除按钮？

文章推荐： c# - 对何时抛出异常感到困惑

gcc - 如何在编译时检测SSE/SSE2/AVX/AVX2/AVX-512/AVX-128-FMA/KCVI可用性？
我正在尝试优化一些矩阵计算，我想知道是否可以在编译时检测 SSE/SSE2/AVX/AVX2/AVX-512/AVX-128-FMA/KCVI[ 1] 是否由编译器启用？非常适合 GCC 和 Clan
avx - 仅使用 avx 而不是 avx2 转置 64 位元素
我想仅使用avx而不是avx2来实现64位转置操作。它应该这样做: // in = Hh Hl Lh Ll // | X | // out = Hh Lh Hl Ll 这就是使
c - 使用单个 AVX 内部函数反转包含 double 值的 AVX 寄存器
如果我有一个 AVX 寄存器，里面有 4 个 double 值，我想将它的反向存储在另一个寄存器中，是否可以用一个内部命令来实现？例如:如果我在 SSE 寄存器中有 4 个 float ，我可以使用
assembly - 首次使用 AVX 256 位向量会减慢 128 位向量和 AVX 标量操作
最初我试图重现 Agner Fog 的微体系结构指南部分“YMM 和 ZMM 向量指令的预热期”中描述的效果，它说: The processor turns off the upper parts o
c++ - 使用 AVX CPU 指令 : Poor performance without "/arch:AVX"
我的 C++ 代码使用 SSE，现在我想改进它以支持 AVX(当它可用时)。因此，我检测 AVX 何时可用并调用使用 AVX 命令的函数。我使用 Win7 SP1 + VS2010 SP1 和带有 A
assembly - 使用 AVX-512 或 AVX-2 对大数据进行 1 位计数(总体计数)
我有一大块内存，比如说 256 KiB 或更长。我想计算整个 block 中 1 位的数量，或者换句话说:将所有字节的“总体计数”值相加。我知道 AVX-512 有一个 VPOPCNTDQ inst
performance - 与没有 AVX 和 AVX2 的情况相比，使用 AVX 和 AVX2 的 tensorflow-gpu 有多快？
有多快 tensorflow-gpu与没有 AVX 和 AVX2 相比，有 AVX 和 AVX2 吗？我试图使用谷歌找到答案，但没有成功。很难重新编译tensorflow-gpu对于 Windows
assembly - avx sqrt的三个操作数？
为什么avx sqrt(非压缩)指令有三个操作数？ vsqrtsd xmm1, xmm2, xmm3 这是否意味着类似于 xmm1=xmm2=sqrt(xmm3)？编辑:下面的详细答案但总之流水线的
assembly - AVX-512中的压缩和扩展指令之间有什么区别？
我正在研究Intel intrinsics guide的展开和压缩操作。我对这两个概念感到困惑: 对于__m128d _mm_mask_expand_pd (__m128d src, __mmask8
intrinsics - AVX 中的分散内在函数
我在 Intel Intrinsic Guide v2.7 中找不到它们。您知道 AVX 或 AVX2 指令集是否支持它们吗？最佳答案原始 AVX 指令集中没有分散或收集指令。 AVX2 添加了收
simd - AVX 版本没有预期的那么快
我正在尝试将函数转换为 AVX 版本。函数本身基本上只是比较浮点数并返回真/假取决于计算。这是原始函数: bool testSingle(float* thisFloat, float* other
我可以正确地比较 avx 中的零寄存器吗？
我遇到了 AVX 内部指令 _mm256_testc_pd() 的一个非常奇怪的行为。在这里你可以看到这个功能的描述 https://software.intel.com/sites/landingp
c++ - AVX，单精度复数的水平和？
我有一个 256 位 AVX 寄存器，其中包含 4 个单精度复数，存储为实数、虚数、实数、虚数等。我目前正在将整个 256 位寄存器写回内存并在那里求和，但这似乎效率低下. 如何使用 AVX(或 AV
当我使用 AVX 功能时崩溃
#include "stdio.h" #include "math.h" #include "stdlib.h" #include "x86intrin.h" void dd_m(double *cl
c++ - AVX 中的水平异或
有没有办法对 AVX 寄存器进行水平异或——特别是对 256 位寄存器的四个 64 位组件进行异或？目标是获得 AVX 寄存器的所有 4 个 64 位组件的异或。它本质上与水平添加( _mm256_
c++ - AVX 循环矢量化错误
当我尝试使用 AVX 获取数据时，出现运行时错误 - 段错误: int i = 0; const int sz = 9; size_t *src1 = (size_t *)_mm_malloc(sz*
c++ - AVX 循环矢量化中的奇怪错误
当我尝试使用 AVX 展开最简单的循环时，出现运行时错误 - 段错误: const int sz = 9; float *src = (float *)_mm_malloc(sz*
使用 AVX 内在函数压缩掩码
我想将两个 256 位 vector (__m256d) 合并为一个 256位 vector ，通过省略每个 64 位 double 的上半部分。所以，如果在下面，a_i, b_i, ... 是 3
c - AVX 标量运算要快得多
我测试了以下简单的功能 void mul(double *a, double *b) { for (int i = 0; i #include #include #include #defi
c++ - AVX(2) 收集指令如何实际计算获取地址？
_mm_i32gather_epi32() 的当前英特尔内在函数指南将每个子词的计算地址描述为: addr := base_addr + SignExtend64(vindex[m+31:m]) *

太空狗

个人简介

我是一名优秀的程序员,十分优秀！

作者热门文章

滴滴打车优惠券免费领取

全站热门文章

首页

博学

6Ren·AI

商城

c++ - GCC 5 及更高版本中的 AVX2 支持

反编译

标准对指针别名的规定:

§ 3.10 Lvalues and rvalues [basic.lval]

附录

海湾合作委员会 4.8.5:

海湾合作委员会 5: