gpt4 book ai didi

c - 二维半字节矩阵的高效转置?

转载 作者:行者123 更新时间:2023-12-05 04:24:34 24 4
gpt4 key购买 nike

给定一个 2D 4x8 半字节矩阵,表示为 16 字节 uint8_t 数组。对于每一对半字节 i、j,字节计算如下:(j << 4) | i .

例如,给定以下矩阵:

    0  1  2  3  3  7  1  9 
4 5 6 7 4 1 6 15
8 9 10 11 3 14 6 11
12 13 14 15 8 10 7 4

表示为:

const uint8_t matrix[] = {
0x10, 0x32, 0x73, 0x91,
0x54, 0x76, 0x14, 0xf6,
0x98, 0xba, 0xe3, 0xb6,
0xdc, 0xfe, 0xa8, 0x47,
};

所需的数组数组将是:

const uint8_t result[] = {
0x40, 0xc8, 0x51, 0xd9,
0x62, 0xea, 0x73, 0xfb,
0x43, 0x83, 0x17, 0xae,
0x61, 0x76, 0xf9, 0x4b,
}

如何实现最有效的功能?扩展到 AVX2 是公平的游戏。

到目前为止,这是我的 C 实现,基于 Nibble shuffling with x64 SIMD .它将矩阵分成两个 64 位输入,解包半字节,打乱它们并重新打包。

__m128i unpack_nibbles(__m128i src) {
__m128i nibbles_hi = _mm_srli_epi64(src, 4);

//Interlave high nibbles with full nibbles [0000 hi, hi lo, ...] and clear high
__m128i unpacked = _mm_unpacklo_epi8(src, nibbles_hi);
return _mm_and_si128(unpacked, _mm_set1_epi8(0xf));
}

void transpose_4x8_nibbles(uint8_t *src, uint8_t *dst) {
uint8_t *src_lo = src + 0x8;

__m128i data_hi = _mm_loadl_epi64((__m128i*)src);
__m128i data_lo = _mm_loadl_epi64((__m128i*)src_lo);

data_hi = unpack_nibbles(data_hi);
data_lo = unpack_nibbles(data_lo);

//Transpose
__m128i transpose_mask = _mm_setr_epi8(0, 0x8, 0x1, 0x9, 0x2, 0xa, 0x3, 0xb, 0x4, 0xc, 0x5, 0xd, 0x6, 0xe, 0x7, 0xf);
data_hi = _mm_shuffle_epi8(data_hi, transpose_mask);
data_lo = _mm_shuffle_epi8(data_lo, transpose_mask);

//Pack nibbles
__m128i pack_mask = _mm_set1_epi16(0x1001);
data_hi = _mm_maddubs_epi16(data_hi, pack_mask); //even bytes are multiplied by 0x10, odd bytes by 0x01
data_lo = _mm_maddubs_epi16(data_lo, pack_mask);

__m128i data = _mm_packus_epi16(data_hi, data_lo);
data = _mm_shuffle_epi8(data, transpose_mask);

_mm_store_si128((__m128i*) dst, data);
}

最佳答案

让我们按如下方式命名半字节(一切都以小端顺序):

X0 Y0 X1 Y1 X2 Y2 X3 Y3
Z0 W0 Z1 W1 Z2 W2 Z3 W3
X4 Y4 X5 Y5 X6 Y6 X7 Y7
Z4 W4 Z5 W5 Z6 W6 Z7 W7

转置后,我们注意到 X 半字节留在低半字节,W 半字节留在高半字节,Y 半字节从高半字节移动到低,Z 半字节从低到高移动:

X0 Z0 X4 Z4
Y0 W0 Y4 W4
X1 Z1 X5 Z5
Y1 W1 Y5 W5
X2 Z2 X6 Z6
Y2 W2 Y6 W6
X3 Z3 X7 Z7
Y3 W3 Y7 W7

这意味着 XW 半字节可以通过简单的 pshufb 正确放置。 Z 半字节都需要向上移动(或乘以 0x10) Y 半字节需要向下移动(或乘以 code>uint160x1000 block 并取结果的上半部分)。

一个 block 00 Z0 00 Z4 Y0 00 Y4 00实际上就像一个32位整数,我们几乎可以直接从Z0 00 Z4 00中得到它00 Y0 00 Y4 通过带有 0x100x1000 的单个 pmaddwd 指令:

00 Z0 00 Z4 Y0 00 Y4 00 = (00 Y0 00 Y4)* 0x1000 + (Z0 00 Z4 00) * 0x10

而这些半字节恰好与X0, X4W0, W4 在相同的字节中所以只需要一个pshufb 来安排相应的字节,但不幸的是,如果 Y4>7 我们有一个负整数,这需要再次屏蔽掉一些位(至少,我们可以重新使用相同的掩码)。

总的来说,这个函数应该完成这个工作:

void transpose_4x8_nibbles(uint8_t const *src, uint8_t *dst) {
__m128i const input = _mm_loadu_si128((__m128i const*)src);

__m128i const shuff = _mm_shuffle_epi8(input, _mm_setr_epi8(0, 8, 4, 12, 1, 9, 5, 13, 2, 10, 6, 14, 3, 11, 7, 15));
__m128i const mask = _mm_set1_epi32(0x0f0ff0f0);
__m128i const XW = _mm_andnot_si128(mask, shuff);
__m128i const YZ = _mm_and_si128(mask, shuff);
__m128i const YZ_trans = _mm_madd_epi16(YZ, _mm_set1_epi32(0x00101000));
__m128i const result = _mm_or_si128(XW, _mm_and_si128(mask, YZ_trans));

_mm_storeu_si128((__m128i*)dst, result);
}

Godbolt 演示(由于 pshufb,只需要 SSSE3):https://godbolt.org/z/c43oTz43r

关于c - 二维半字节矩阵的高效转置?,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/73450997/

24 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com