gpt4 book ai didi

c++ - 使用SIMD根据另一个 vector 位值计算值的乘积

转载 作者:行者123 更新时间:2023-12-01 14:58:32 25 4
gpt4 key购买 nike

我有两个 vector 。一个为a大小的N的double的 vector 和一个为b大小的无符号字符ceil(N/8)的 vector 。目的是计算a的某些值的乘积。将逐位读取b,其中每个位指示是否要在产品中考虑double中给定的a

  // Let's create some data      
unsigned nbBits = 1e7;
unsigned nbBytes = nbBits / 8;
unsigned char nbBitsInLastByte = nbBits % 8;
assert(nbBits == nbBytes * 8 + nbBitsInLastByte);
std::vector<double> a(nbBits, 0.999999); // In practice a values will vary. It is just an easy to build example I am showing here
std::vector<unsigned char> b(nbBytes, false); // I am not using `vector<bool>` nor `bitset`. I've got my reasons!
assert(a.size() == b.size() * 8);

// Set a few bits to true
for (unsigned byte = 0 ; byte < (nbBytes-1) ; byte+=2)
{
b[byte] |= 1 << 2; // set second (zero-based counting) bit to 'true'
b[byte] |= 1 << 7; // set last bit to 'true'
// ^ This is the bit index
}

如上所述,我的目标是在 a为true时计算 b中值的乘积。这可以用
  // Initialize the variable we want to compute
double product = 1.0;

// Product for the first nbByts-1 bytes
for (unsigned byte = 0 ; byte < (nbBytes-1) ; ++byte)
{
for (unsigned bit = 0 ; bit < 8 ; ++bit) // inner loop could be manually unrolled
{
if((b[byte] >> bit) & 1) // gets the bit value
product *= a[byte*8+bit];
}
}

// Product for the last byte
for (unsigned bit = 0 ; bit < nbBitsInLastByte ; ++bit)
{
if((b[nbBytes-1] >> bit) & 1) // gets the bit value
product *= a[(nbBytes-1)*8+bit];
}

这种乘积计算是我代码的慢速部分。我想知道是否显式向量化(SIMD)这个过程在这里是否有帮助?我一直在看'xmmintrin.h'中提供的功能,但是我对SIMD知之甚少,因此未能找到有帮助的东西。你能帮助我吗?

最佳答案

如果您不关心乘法顺序,那很容易。关键是来自SSE 4.1 set的_mm_blendv_pd指令。这使您可以完全无分支。

// Load 2 double values from source pointer, and conditionally multiply with the product.
// Returns the new product.
template<int startIdx>
inline __m128d product2( const double* pSource, __m128i mask, __m128d oldProduct )
{
// Multiply values unconditionally
const __m128d source = _mm_loadu_pd( pSource + startIdx );
const __m128d newProduct = _mm_mul_pd( source, oldProduct );

// We only calling product2 with 4 different template arguments.
// There are 16 vector registers in total, enough for all 4 different `maskAndBits` values.
constexpr int64_t bit1 = 1 << startIdx;
constexpr int64_t bit2 = 1 << ( startIdx + 1 );
const __m128i maskAndBits = _mm_setr_epi64x( bit1, bit2 );
mask = _mm_and_si128( mask, maskAndBits );

// NAN if the mask is 0 after the above AND i.e. the bit was not set, 0.0 if the bit was set
const __m128d maskDouble = _mm_castsi128_pd( _mm_cmpeq_epi64( mask, _mm_setzero_si128() ) );

// This instruction actually does the masking, it's from SSE 4.1
return _mm_blendv_pd( newProduct, oldProduct, maskDouble );
}

double conditionalProducts( const double* ptr, const uint8_t* masks, size_t size )
{
// Round down the size of your input vector, and multiply last couple values the old way.
assert( 0 == size % 8 );
__m128d prod = _mm_set1_pd( 1.0 );
const double* const end = ptr + size;
while( ptr < end )
{
// Broadcast the mask byte into 64-bit integer lanes
const __m128i mask = _mm_set1_epi64x( *masks );
// Compute the conditional products of 8 values
prod = product2<0>( ptr, mask, prod );
prod = product2<2>( ptr, mask, prod );
prod = product2<4>( ptr, mask, prod );
prod = product2<6>( ptr, mask, prod );
// Advance the pointers
ptr += 8;
masks++;
}
// Multiply two lanes together
prod = _mm_mul_sd( prod, _mm_shuffle_pd( prod, prod, 0b11 ) );
return _mm_cvtsd_f64( prod );
}

关于c++ - 使用SIMD根据另一个 vector 位值计算值的乘积,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/58753959/

25 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com