gpt4 book ai didi

sse - 有没有一种方法可以根据编译时未知的掩码长度来掩码 __m128i 寄存器的一端?

转载 作者:行者123 更新时间:2023-12-04 15:11:06 27 4
gpt4 key购买 nike

我有一个看似简单的问题。将字符串加载到 __m128i 寄存器(使用 _mm_loadu_si128),然后找到字符串的长度(使用 _mm_cmpistri)。现在,假设长度小于 16,我希望在第一个字符串结尾的零之后只有零。实现此目的的一种方法是仅将“len”字节复制到另一个寄存器,或者将原始寄存器与长度为 8 * len 的 1 掩码进行“与”运算。但要找到创建这种仅取决于计算长度的掩码的简单方法并不容易。

最佳答案

我会这样做。未经测试。

// Load 16 bytes and propagate the first zero towards the end of the register
inline __m128i loadNullTerminated( const char* pointer )
{
// Load 16 bytes
const __m128i chars = _mm_loadu_si128( ( const __m128i* )pointer );

const __m128i zero = _mm_setzero_si128();
// 0xFF for bytes that were '\0', 0 otherwise
__m128i zeroBytes = _mm_cmpeq_epi8( chars, zero );

// If you have long strings and expect most calls to not have any zeros, uncomment the line below.
// You can return a flag to the caller, to know when to stop.
// if( _mm_testz_si128( zeroBytes, zeroBytes ) ) return chars;

// Propagate the first "0xFF" byte towards the end of the register.
// Following 8 instructions are fast, 1 cycle latency/each.
// Pretty sure _mm_movemask_epi8 / _BitScanForward / _mm_loadu_si128 is slightly slower even when the mask is in L1D
zeroBytes = _mm_or_si128( zeroBytes, _mm_slli_si128( zeroBytes, 1 ) );
zeroBytes = _mm_or_si128( zeroBytes, _mm_slli_si128( zeroBytes, 2 ) );
zeroBytes = _mm_or_si128( zeroBytes, _mm_slli_si128( zeroBytes, 4 ) );
zeroBytes = _mm_or_si128( zeroBytes, _mm_slli_si128( zeroBytes, 8 ) );
// Now apply that mask
return _mm_andnot_si128( zeroBytes, chars );
}

更新:这是另一个版本,使用了 Noah 关于 int64 -1 指令的想法。可能会稍微快一点。 Disassembly.

__m128i loadNullTerminated_v2( const char* pointer )
{
// Load 16 bytes
const __m128i chars = _mm_loadu_si128( ( const __m128i* )pointer );

const __m128i zero = _mm_setzero_si128();
// 0xFF for bytes that were '\0', 0 otherwise
const __m128i zeroBytes = _mm_cmpeq_epi8( chars, zero );

// If you have long strings and expect most calls to not have any zeros, uncomment the line below.
// You can return a flag to the caller, to know when to stop.
// if( _mm_testz_si128( eq_zero, eq_zero ) ) return chars;

// Using the fact that v-1 == v+(-1), and -1 has all bits set
const __m128i ones = _mm_cmpeq_epi8( zero, zero );
__m128i mask = _mm_add_epi64( zeroBytes, ones );
// This instruction makes a mask filled with lowest valid bytes in each 64-bit lane
mask = _mm_andnot_si128( zeroBytes, mask );

// Now need to propagate across 64-bit lanes

// ULLONG_MAX if there were no zeros in the corresponding 8-byte long pieces of the string
__m128i crossLaneMask = _mm_cmpeq_epi64( zeroBytes, zero );
// Move the lower 64-bit lanes of noZeroes64 into higher position
crossLaneMask = _mm_unpacklo_epi64( mask, crossLaneMask );
// Update the mask.
// Lower 8 bytes will not change because _mm_unpacklo_epi64 copied that part from the mask.
// However, upper lane may become zeroed out.
// Happens when _mm_cmpeq_epi64 detected at least 1 '\0' in any of the first 8 characters.
mask = _mm_and_si128( mask, crossLaneMask );

// Apply that mask
return _mm_and_si128( mask, chars );
}

关于sse - 有没有一种方法可以根据编译时未知的掩码长度来掩码 __m128i 寄存器的一端?,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/65186226/

27 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com