gpt4 book ai didi

android - 优化 NEON 装配功能

转载 作者:太空宇宙 更新时间:2023-11-03 10:41:08 25 4
gpt4 key购买 nike

我正在开发一个应该在 ARMv7 处理器设备上运行的原生 Android 应用程序。由于某些原因,我需要对 vector (短和/或浮点)进行一些繁重的计算。我使用 NEON 命令实现了一些汇编函数来提高计算量。我获得了 1.5 的速度系数,这还不错。我想知道我是否可以改进这些功能以使其运行得更快。

那么问题是:我可以做哪些改变来改进这些功能?

    //add to float vectors.
//the result could be put in scr1 instead of dst
void add_float_vector_with_neon3(float* dst, float* src1, float* src2, int count)
{

asm volatile (
"1: \n"
"vld1.32 {q0}, [%[src1]]! \n"
"vld1.32 {q1}, [%[src2]]! \n"
"vadd.f32 q0, q0, q1 \n"
"subs %[count], %[count], #4 \n"
"vst1.32 {q0}, [%[dst]]! \n"
"bgt 1b \n"
: [dst] "+r" (dst)
: [src1] "r" (src1), [src2] "r" (src2), [count] "r" (count)
: "memory", "q0", "q1"
);
}

//multiply a float vector by a scalar.
//the result could be put in scr1 instead of dst
void mul_float_vector_by_scalar_with_neon3(float* dst, float* src1, float scalar, int count)
{

asm volatile (

"vdup.32 q1, %[scalar] \n"
"2: \n"
"vld1.32 {q0}, [%[src1]]! \n"
"vmul.f32 q0, q0, q1 \n"
"subs %[count], %[count], #4 \n"
"vst1.32 {q0}, [%[dst]]! \n"
"bgt 2b \n"
: [dst] "+r" (dst)
: [src1] "r" (src1), [scalar] "r" (scalar), [count] "r" (count)
: "memory", "q0", "q1"
);
}

//add to short vector -> no problem of coding limits
//the result should be put in in a dest different from src1 and scr2
void add_short_vector_with_neon3(short* dst, short* src1, short* src2, int count)
{

asm volatile (
"3: \n"
"vld1.16 {q0}, [%[src1]]! \n"
"vld1.16 {q1}, [%[src2]]! \n"
"vadd.i16 q0, q0, q1 \n"
"subs %[count], %[count], #8 \n"
"vst1.16 {q0}, [%[dst]]! \n"
"bgt 3b \n"
: [dst] "+r" (dst)
: [src1] "r" (src1), [src2] "r" (src2), [count] "r" (count)
: "memory", "q0", "q1"
);
}

//multiply a short vector by a float vector and put the result bach into a short vector
//the result should be put in in a dest different from src1
void mul_short_vector_by_float_vector_with_neon3(short* dst, short* src1, float* src2, int count)
{
asm volatile (
"4: \n"
"vld1.16 {d0}, [%[src1]]! \n"
"vld1.32 {q1}, [%[src2]]! \n"
"vmovl.s16 q0, d0 \n"
"vcvt.f32.s32 q0, q0 \n"
"vmul.f32 q0, q0, q1 \n"
"vcvt.s32.f32 q0, q0 \n"
"vmovn.s32 d0, q0 \n"
"subs %[count], %[count], #4 \n"
"vst1.16 {d0}, [%[dst]]! \n"
"bgt 4b \n"
: [dst] "+r" (dst)
: [src1] "r" (src1), [src2] "r" (src2), [count] "r" (count)
: "memory", "d0", "q0", "q1"

);
}

提前致谢!

最佳答案

您可以尝试展开循环以在每个循环中处理更多元素。

您的 add_float_vector_with_neon3 代码每 4 个元素需要 10 个周期(因为停顿),而展开到 16 个元素需要 21 个周期。 http://pulsar.webshaker.net/ccc/sample-34e5f701

虽然有开销因为你需要处理余数(或者你可以填充你的数据为 16 的倍数),但如果你有很多数据,与实际总和相比开销应该相当低。

关于android - 优化 NEON 装配功能,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/33435328/

25 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com