gpt4 book ai didi

c - Neon 相当于 SSE 内在函数

转载 作者:太空狗 更新时间:2023-10-29 17:03:17 26 4
gpt4 key购买 nike

我正在尝试使用 neon 内在函数将 c 代码转换为优化代码。

这里是对 2 个操作而不是操作 vector 进行操作的 c 代码。

uint16_t mult_z216(uint16_t a,uint16_t b){
unsigned int c1 = a*b;
if(c1)
{
int c1h = c1 >> 16;
int c1l = c1 & 0xffff;
return (c1l - c1h + ((c1l<c1h)?1:0)) & 0xffff;
}
return (1-a-b) & 0xffff;
}

此操作的 SEE 优化版本已由以下人员实现:

#define MULT_Z216_SSE(a, b, c) \
t0 = _mm_or_si128 ((a), (b)); \ //Computes the bitwise OR of the 128-bit value in a and the 128-bit value in b.
(c) = _mm_mullo_epi16 ((a), (b)); \ //low 16-bits of the product of two 16-bit integers
(a) = _mm_mulhi_epu16 ((a), (b)); \ //high 16-bits of the product of two 16-bit unsigned integers
(b) = _mm_subs_epu16((c), (a)); \ //Subtracts the 8 unsigned 16-bit integers of a from the 8 unsigned 16-bit integers of c and saturates
(b) = _mm_cmpeq_epi16 ((b), C_0x0_XMM); \ //Compares the 8 signed or unsigned 16-bit integers in a and the 8 signed or unsigned 16-bit integers in b for equality. (0xFFFF or 0x0)
(b) = _mm_srli_epi16 ((b), 15); \ //shift right 16 bits
(c) = _mm_sub_epi16 ((c), (a)); \ //Subtracts the 8 signed or unsigned 16-bit integers of b from the 8 signed or unsigned 16-bit integers of a.
(a) = _mm_cmpeq_epi16 ((c), C_0x0_XMM); \ ////Compares the 8 signed or unsigned 16-bit integers in a and the 8 signed or unsigned 16-bit integers in b for equality. (0xFFFF or 0x0)
(c) = _mm_add_epi16 ((c), (b)); \ // Adds the 8 signed or unsigned 16-bit integers in a to the 8 signed or unsigned 16-bit integers in b.
t0 = _mm_and_si128 (t0, (a)); \ //Computes the bitwise AND of the 128-bit value in a and the 128-bit value in b.
(c) = _mm_sub_epi16 ((c), t0); ///Subtracts the 8 signed or unsigned 16-bit integers of b from the 8 signed or unsigned 16-bit integers of a.

我几乎已经使用 neon 内在函数转换了这个:

#define MULT_Z216_NEON(a, b, out) \
temp = vorrq_u16 (*a, *b); \
// ??
// ??
*b = vsubq_u16(*out, *a); \
*b = vceqq_u16(*out, vdupq_n_u16(0x0000)); \
*b = vshrq_n_u16(*b, 15); \
*out = vsubq_s16(*out, *a); \
*a = vceqq_s16(*c, vdupq_n_u16(0x0000)); \
*c = vaddq_s16(*c, *b); \
*temp = vandq_u16(*temp, *a); \
*out = vsubq_s16(*out, *a);

我只缺少 _mm_mullo_epi16 ((a), (b));_mm_mulhi_epu16 ((a), (b)); 的 NEON 等效项。要么我误解了什么,要么 NEON 中没有这样的内在函数。如果没有等效项,如何使用 NEONS 内在函数来归档这些步骤?

更新:

我忘了强调以下几点:函数的运算符是 uint16x8_t NEON vector (每个元素都是 uint16_t => 0 到 65535 之间的整数)。在回答中有人建议使用内在的 vqdmulhq_s16()。这个的使用与给定的实现不匹配,因为乘法内在函数会将 vector 解释为带符号的值并产生错误的输出。

最佳答案

您可以使用:

uint32x4_t vmull_u16 (uint16x4_t, uint16x4_t) 

返回 32 位乘积的 vector 。如果您想将结果分成高低两部分,您可以使用 NEON 解压缩内在函数。

关于c - Neon 相当于 SSE 内在函数,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/11292884/

26 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com