gpt4 book ai didi

c++ - SIMD XOR 运算不如 Integer XOR 有效?

转载 作者:可可西里 更新时间:2023-11-01 18:29:26 33 4
gpt4 key购买 nike

我的任务是计算数组中字节的异或和:

X = char1 XOR char2 XOR char3 ... charN;

我正在尝试将其并行化,改为对 __m128 进行异或运算。这应该提供加速因子 4。另外,要重新检查我使用 int 的算法。这应该提供加速因子 4。测试程序有 100 行长,我不能再短了,但是很简单:

#include "xmmintrin.h" // simulation of the SSE instruction
#include <ctime>

#include <iostream>
using namespace std;

#include <stdlib.h> // rand

const int NIter = 100;

const int N = 40000000; // matrix size. Has to be dividable by 4.
unsigned char str[N] __attribute__ ((aligned(16)));

template< typename T >
T Sum(const T* data, const int N)
{
T sum = 0;
for ( int i = 0; i < N; ++i )
sum = sum ^ data[i];
return sum;
}

template<>
__m128 Sum(const __m128* data, const int N)
{
__m128 sum = _mm_set_ps1(0);
for ( int i = 0; i < N; ++i )
sum = _mm_xor_ps(sum,data[i]);
return sum;
}

int main() {

// fill string by random values
for( int i = 0; i < N; i++ ) {
str[i] = 256 * ( double(rand()) / RAND_MAX ); // put a random value, from 0 to 255
}

/// -- CALCULATE --

/// SCALAR

unsigned char sumS = 0;
std::clock_t c_start = std::clock();
for( int ii = 0; ii < NIter; ii++ )
sumS = Sum<unsigned char>( str, N );
double tScal = 1000.0 * (std::clock()-c_start) / CLOCKS_PER_SEC;

/// SIMD

unsigned char sumV = 0;

const int m128CharLen = 4*4;
const int NV = N/m128CharLen;

c_start = std::clock();
for( int ii = 0; ii < NIter; ii++ ) {
__m128 sumVV = _mm_set_ps1(0);
sumVV = Sum<__m128>( reinterpret_cast<__m128*>(str), NV );
unsigned char *sumVS = reinterpret_cast<unsigned char*>(&sumVV);

sumV = sumVS[0];
for ( int iE = 1; iE < m128CharLen; ++iE )
sumV ^= sumVS[iE];
}
double tSIMD = 1000.0 * (std::clock()-c_start) / CLOCKS_PER_SEC;

/// SCALAR INTEGER

unsigned char sumI = 0;

const int intCharLen = 4;
const int NI = N/intCharLen;

c_start = std::clock();
for( int ii = 0; ii < NIter; ii++ ) {
int sumII = Sum<int>( reinterpret_cast<int*>(str), NI );
unsigned char *sumIS = reinterpret_cast<unsigned char*>(&sumII);

sumI = sumIS[0];
for ( int iE = 1; iE < intCharLen; ++iE )
sumI ^= sumIS[iE];
}
double tINT = 1000.0 * (std::clock()-c_start) / CLOCKS_PER_SEC;

/// -- OUTPUT --

cout << "Time scalar: " << tScal << " ms " << endl;
cout << "Time INT: " << tINT << " ms, speed up " << tScal/tINT << endl;
cout << "Time SIMD: " << tSIMD << " ms, speed up " << tScal/tSIMD << endl;

if(sumV == sumS && sumI == sumS )
std::cout << "Results are the same." << std::endl;
else
std::cout << "ERROR! Results are not the same." << std::endl;

return 1;
}

典型结果:

[10:46:20]$ g++ test.cpp -O3 -fno-tree-vectorize; ./a.out
Time scalar: 3540 ms
Time INT: 890 ms, speed up 3.97753
Time SIMD: 280 ms, speed up 12.6429
Results are the same.
[10:46:27]$ g++ test.cpp -O3 -fno-tree-vectorize; ./a.out
Time scalar: 3540 ms
Time INT: 890 ms, speed up 3.97753
Time SIMD: 280 ms, speed up 12.6429
Results are the same.
[10:46:35]$ g++ test.cpp -O3 -fno-tree-vectorize; ./a.out
Time scalar: 3640 ms
Time INT: 880 ms, speed up 4.13636
Time SIMD: 290 ms, speed up 12.5517
Results are the same.

如您所见,int 版本运行理想,但 simd 版本损失了 25% 的速度,但这是稳定的。我尝试更改数组大小,但这没有帮助。

此外,如果我切换到 -O2,我会失去 simd 版本中 75% 的速度:

[10:50:25]$ g++ test.cpp -O2 -fno-tree-vectorize; ./a.out
Time scalar: 3640 ms
Time INT: 880 ms, speed up 4.13636
Time SIMD: 890 ms, speed up 4.08989
Results are the same.
[10:51:16]$ g++ test.cpp -O2 -fno-tree-vectorize; ./a.out
Time scalar: 3640 ms
Time INT: 900 ms, speed up 4.04444
Time SIMD: 880 ms, speed up 4.13636
Results are the same.

有人能给我解释一下吗?

附加信息:

  1. 我有 g++ (GCC) 4.7.3;英特尔(R) 至强(R) CPU E7-4860

  2. 我使用 -fno-tree-vectorize 来防止自动矢量化。如果没有此标志 -O3预期加速为 1,因为任务很简单。这是我得到的:

    [10:55:40]$ g++ test.cpp -O3; ./a.out
    Time scalar: 270 ms
    Time INT: 270 ms, speed up 1
    Time SIMD: 280 ms, speed up 0.964286
    Results are the same.

    但是 -O2 结果仍然很奇怪:

    [10:55:02]$ g++ test.cpp -O2; ./a.out
    Time scalar: 3540 ms
    Time INT: 990 ms, speed up 3.57576
    Time SIMD: 880 ms, speed up 4.02273
    Results are the same.
  3. 当我改变的时候

    for ( int i = 0; i < N; i+=1 )
    sum = sum ^ data[i];

    相当于:

    for ( int i = 0; i < N; i+=8 )
    sum = (data[i] ^ data[i+1]) ^ (data[i+2] ^ data[i+3]) ^ (data[i+4] ^ data[i+5]) ^ (data[i+6] ^ data[i+7]) ^ sum;

    我确实看到标量速度提高了 2 倍。但我没有看到加速方面的改进。之前:intSpeedUp 3.98416,SIMDSpeedUP 12.5283。之后:intSpeedUp 3.5572,SIMDSpeedUP 6.8523。

最佳答案

SSE2 在对完全并行的数据进行操作时是最佳的。例如

for (int i = 0 ; i < N ; ++i)
z[i] = _mm_xor_ps(x[i], y[i]);

但在您的情况下,循环的每次迭代都取决于前一次迭代的输出。这被称为依赖链。简而言之,这意味着每个连续的 xor 都必须等待前一个 xor 的整个延迟才能继续,因此会降低吞吐量。

关于c++ - SIMD XOR 运算不如 Integer XOR 有效?,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/23359973/

33 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com