gpt4 book ai didi

optimization - 置换 SSE __m128i 寄存器内的字节

转载 作者:行者123 更新时间:2023-12-03 15:42:53 27 4
gpt4 key购买 nike

我有以下问题:

__m128i寄存器有 16 个 8 位值,按以下顺序排列:

[ 1, 5, 9, 13 ] [ 2, 6, 10, 14] [3, 7, 11, 15]  [4, 8, 12, 16]

我想实现的是有效地洗牌字节以获得此排序:
[ 1, 2, 3, 4 ] [ 5, 6, 7, 8] [9, 10, 11, 12]  [13, 14, 15, 16]

它实际上类似于 4x4 矩阵转置,但在 8 位元素上运行
在一个寄存器内。

你能指点我什么样的SSE(最好<= SSE2)指令
适合实现这一点吗?

最佳答案

为此,您真的会想去 SSSE3,这比尝试去 <= SSE2 干净得多

您的代码将如下所示:

   #include <tmmintrin.h> // _mm_shuffle_epi8
#include <tmmintrin.h> // _mm_set_epi8
...
// check if your hardware supports SSSE3
...
__m128i mask = _mm_set_epi8(15, 11, 7, 3,
14, 10, 6, 2,
13, 9, 5, 1,
12, 8, 4, 0);
__m128i mtrx = _mm_set_epi8(16, 12, 8, 4,
15, 11, 7, 3,
14, 10, 6, 2,
13, 9, 5, 1);
mtrx = _mm_shuffle_epi8(mtrx, mask);

如果你真的想要 SSE2,这就足够了:
(假设我正确解释了您的初始订购)
  __m128i mask = _mm_set_epi8(0x00, 0xFF, 0x00, 0xFF,
0x00, 0xFF, 0x00, 0xFF,
0x00, 0xFF, 0x00, 0xFF,
0x00, 0xFF, 0x00, 0xFF);
__m128i mtrx = _mm_set_epi8(16, 12, 8, 4,
15, 11, 7, 3,
14, 10, 6, 2,
13, 9, 5, 1); // [1, 5, 9, 13] [2, 6, 10, 14] [3, 7, 11, 15] [ 4, 8, 12, 16]
mtrx = _mm_packus_epi16(_mm_and_si128(mtrx, mask), _mm_srli_epi16(mtrx, 8)); // [1, 9, 2, 10] [3, 11, 4, 12] [5, 13, 6, 14] [ 7, 15, 8, 16]
mtrx = _mm_packus_epi16(_mm_and_si128(mtrx, mask), _mm_srli_epi16(mtrx, 8)); // [1, 2, 3, 4] [5, 6, 7, 8] [9, 10, 11, 12] [13, 14, 15, 16]

或更容易调试:
  __m128i mtrx = _mm_set_epi8(16, 12, 8, 4,
15, 11, 7, 3,
14, 10, 6, 2,
13, 9, 5, 1); // [1, 5, 9, 13] [ 2, 6, 10, 14] [ 3, 7, 11, 15] [ 4, 8, 12, 16]
__m128i mask = _mm_set_epi8(0x00, 0xFF, 0x00, 0xFF,
0x00, 0xFF, 0x00, 0xFF,
0x00, 0xFF, 0x00, 0xFF,
0x00, 0xFF, 0x00, 0xFF);
__m128i temp = _mm_srli_epi16(mtrx, 8); // [5, 0, 13, 0] [ 6, 0, 14, 0] [ 7, 0, 15, 0] [ 8, 0, 16, 0]
mtrx = _mm_and_si128(mtrx, mask); // [1, 0, 9, 0] [ 2, 0, 10, 0] [ 3, 0, 11, 0] [ 4, 0, 12, 0]
mtrx = _mm_packus_epi16(mtrx, temp); // [1, 9, 2, 10] [ 3, 11, 4, 12] [ 5, 13, 6, 14] [ 7, 15, 8, 16]
temp = _mm_srli_epi16(mtrx, 8); // [9, 0, 10, 0] [11, 0, 12, 0] [13, 0, 14, 0] [15, 0, 16, 0]
mtrx = _mm_and_si128(mtrx, mask); // [1, 0, 2, 0] [ 3, 0, 4, 0] [ 5, 0, 6, 0] [ 7, 0, 8, 0]
mtrx = _mm_packus_epi16(mtrx, temp); // [1, 2, 3, 4] [ 5, 6, 7, 8] [ 9, 10, 11, 12] [13, 14, 15, 16]

关于optimization - 置换 SSE __m128i 寄存器内的字节,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/24595003/

27 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com