gpt4 book ai didi

iphone - C vs 汇编程序 vs NEON 性能

转载 作者:太空狗 更新时间:2023-10-29 16:29:55 25 4
gpt4 key购买 nike

我正在开发一个进行实时图像处理的 iPhone 应用程序。其流程中最早的步骤之一是将 BGRA 图像转换为灰度。我尝试了几种不同的方法,计时结果的差异远比我想象的要大。首先,我尝试使用 C。我通过添加 B+2*G+R/4 来近似转换为亮度

void BGRA_To_Byte(Image<BGRA> &imBGRA, Image<byte> &imByte)
{
uchar *pIn = (uchar*) imBGRA.data;
uchar *pLimit = pIn + imBGRA.MemSize();

uchar *pOut = imByte.data;
for(; pIn < pLimit; pIn+=16) // Does four pixels at a time
{
unsigned int sumA = pIn[0] + 2 * pIn[1] + pIn[2];
pOut[0] = sumA / 4;
unsigned int sumB = pIn[4] + 2 * pIn[5] + pIn[6];
pOut[1] = sumB / 4;
unsigned int sumC = pIn[8] + 2 * pIn[9] + pIn[10];
pOut[2] = sumC / 4;
unsigned int sumD = pIn[12] + 2 * pIn[13] + pIn[14];
pOut[3] = sumD / 4;
pOut +=4;
}
}

此代码需要 55 毫秒来转换 352x288 图像。然后我找到了一些基本上做同样事情的汇编代码

void BGRA_To_Byte(Image<BGRA> &imBGRA, Image<byte> &imByte)
{
uchar *pIn = (uchar*) imBGRA.data;
uchar *pLimit = pIn + imBGRA.MemSize();

unsigned int *pOut = (unsigned int*) imByte.data;

for(; pIn < pLimit; pIn+=16) // Does four pixels at a time
{
register unsigned int nBGRA1 asm("r4");
register unsigned int nBGRA2 asm("r5");
unsigned int nZero=0;
unsigned int nSum1;
unsigned int nSum2;
unsigned int nPacked1;
asm volatile(

"ldrd %[nBGRA1], %[nBGRA2], [ %[pIn], #0] \n" // Load in two BGRA words
"usad8 %[nSum1], %[nBGRA1], %[nZero] \n" // Add R+G+B+A
"usad8 %[nSum2], %[nBGRA2], %[nZero] \n" // Add R+G+B+A
"uxtab %[nSum1], %[nSum1], %[nBGRA1], ROR #8 \n" // Add G again
"uxtab %[nSum2], %[nSum2], %[nBGRA2], ROR #8 \n" // Add G again
"mov %[nPacked1], %[nSum1], LSR #2 \n" // Init packed word
"mov %[nSum2], %[nSum2], LSR #2 \n" // Div by four
"add %[nPacked1], %[nPacked1], %[nSum2], LSL #8 \n" // Add to packed word

"ldrd %[nBGRA1], %[nBGRA2], [ %[pIn], #8] \n" // Load in two more BGRA words
"usad8 %[nSum1], %[nBGRA1], %[nZero] \n" // Add R+G+B+A
"usad8 %[nSum2], %[nBGRA2], %[nZero] \n" // Add R+G+B+A
"uxtab %[nSum1], %[nSum1], %[nBGRA1], ROR #8 \n" // Add G again
"uxtab %[nSum2], %[nSum2], %[nBGRA2], ROR #8 \n" // Add G again
"mov %[nSum1], %[nSum1], LSR #2 \n" // Div by four
"add %[nPacked1], %[nPacked1], %[nSum1], LSL #16 \n" // Add to packed word
"mov %[nSum2], %[nSum2], LSR #2 \n" // Div by four
"add %[nPacked1], %[nPacked1], %[nSum2], LSL #24 \n" // Add to packed word

///////////
////////////

: [pIn]"+r" (pIn),
[nBGRA1]"+r"(nBGRA1),
[nBGRA2]"+r"(nBGRA2),
[nZero]"+r"(nZero),
[nSum1]"+r"(nSum1),
[nSum2]"+r"(nSum2),
[nPacked1]"+r"(nPacked1)
:
: "cc" );
*pOut = nPacked1;
pOut++;
}
}

此函数可在 12 毫秒内转换相同的图像,几乎快 5 倍!我以前没有用汇编程序编程过,但我认为对于这样一个简单的操作,它不会比 C 语言快那么多。受此成功的启发,我继续搜索并发现了一个 NEON 转换示例 here .

void greyScaleNEON(uchar* output_data, uchar* input_data, int tot_pixels)
{
__asm__ volatile("lsr %2, %2, #3 \n"
"# build the three constants: \n"
"mov r4, #28 \n" // Blue channel multiplier
"mov r5, #151 \n" // Green channel multiplier
"mov r6, #77 \n" // Red channel multiplier
"vdup.8 d4, r4 \n"
"vdup.8 d5, r5 \n"
"vdup.8 d6, r6 \n"
"0: \n"
"# load 8 pixels: \n"
"vld4.8 {d0-d3}, [%1]! \n"
"# do the weight average: \n"
"vmull.u8 q7, d0, d4 \n"
"vmlal.u8 q7, d1, d5 \n"
"vmlal.u8 q7, d2, d6 \n"
"# shift and store: \n"
"vshrn.u16 d7, q7, #8 \n" // Divide q3 by 256 and store in the d7
"vst1.8 {d7}, [%0]! \n"
"subs %2, %2, #1 \n" // Decrement iteration count
"bne 0b \n" // Repeat unil iteration count is not zero
:
: "r"(output_data),
"r"(input_data),
"r"(tot_pixels)
: "r4", "r5", "r6"
);
}

计时结果令人难以置信。它在 1 毫秒内转换相同的图像。比汇编程序快 12 倍,比 C 快 55 倍。我不知道这样的性能提升是可能的。鉴于此,我有几个问题。首先,我在 C 代码中做错了什么吗?我仍然很难相信它是如此之慢。其次,如果这些结果是准确的,那么在什么样的情况下我可以期望看到这些 yield ?您可能可以想象,我对让管道的其他部分运行速度提高 55 倍的前景感到多么兴奋。我是否应该学习汇编程序/NEON 并在需要大量时间的任何循环中使用它们?

更新 1:我已将 C 函数的汇编程序输出发布在一个文本文件中,网址为 http://temp-share.com/show/f3Yg87jQn它太大了,无法直接包含在这里。

使用 OpenCV 函数完成计时。

double duration = static_cast<double>(cv::getTickCount()); 
//function call
duration = static_cast<double>(cv::getTickCount())-duration;
duration /= cv::getTickFrequency();
//duration should now be elapsed time in ms

结果

我测试了几个建议的改进。首先,按照 Viktor 的建议,我重新排序了内部循环以将所有提取放在首位。内循环看起来像。

for(; pIn < pLimit; pIn+=16)   // Does four pixels at a time
{
//Jul 16, 2012 MR: Read and writes collected
sumA = pIn[0] + 2 * pIn[1] + pIn[2];
sumB = pIn[4] + 2 * pIn[5] + pIn[6];
sumC = pIn[8] + 2 * pIn[9] + pIn[10];
sumD = pIn[12] + 2 * pIn[13] + pIn[14];
pOut +=4;
pOut[0] = sumA / 4;
pOut[1] = sumB / 4;
pOut[2] = sumC / 4;
pOut[3] = sumD / 4;
}

此更改使处理时间缩短了 2 毫秒,降至 53 毫秒。接下来,按照 Victor 的建议,我将函数更改为 fetch as uint。内循环看起来像

unsigned int* in_int = (unsigned int*) original.data;
unsigned int* end = (unsigned int*) in_int + out_length;
uchar* out = temp.data;

for(; in_int < end; in_int+=4) // Does four pixels at a time
{
unsigned int pixelA = in_int[0];
unsigned int pixelB = in_int[1];
unsigned int pixelC = in_int[2];
unsigned int pixelD = in_int[3];

uchar* byteA = (uchar*)&pixelA;
uchar* byteB = (uchar*)&pixelB;
uchar* byteC = (uchar*)&pixelC;
uchar* byteD = (uchar*)&pixelD;

unsigned int sumA = byteA[0] + 2 * byteA[1] + byteA[2];
unsigned int sumB = byteB[0] + 2 * byteB[1] + byteB[2];
unsigned int sumC = byteC[0] + 2 * byteC[1] + byteC[2];
unsigned int sumD = byteD[0] + 2 * byteD[1] + byteD[2];

out[0] = sumA / 4;
out[1] = sumB / 4;
out[2] = sumC / 4;
out[3] = sumD / 4;
out +=4;
}

此修改产生了戏剧性的效果,将处理时间降至 14 毫秒,下降了 39 毫秒 (75%)。最后的结果非常接近 11ms 的汇编器性能。 rob 建议的最终优化是包含 __restrict 关键字。我在每个指针声明前添加了它,更改了以下行

__restrict unsigned int* in_int = (unsigned int*) original.data;
unsigned int* end = (unsigned int*) in_int + out_length;
__restrict uchar* out = temp.data;
...
__restrict uchar* byteA = (uchar*)&pixelA;
__restrict uchar* byteB = (uchar*)&pixelB;
__restrict uchar* byteC = (uchar*)&pixelC;
__restrict uchar* byteD = (uchar*)&pixelD;
...

这些更改对处理时间没有明显影响。感谢大家的帮助,以后我会更加关注内存管理。

最佳答案

这里有关于 NEON“成功”的一些原因的解释:http://hilbert-space.de/?p=22

尝试使用“-S -O3”开关编译 C 代码以查看 GCC 编译器的优化输出。

恕我直言,成功的关键是两个程序集版本都采用了优化的读/写模式。 NEON/MMX/其他 vector 引擎也支持饱和(无需使用“unsigned ints”即可将结果限制为 0..255)。

在循环中查看这些行:

unsigned int sumA = pIn[0] + 2 * pIn[1] + pIn[2];
pOut[0] = sumA / 4;
unsigned int sumB = pIn[4] + 2 * pIn[5] + pIn[6];
pOut[1] = sumB / 4;
unsigned int sumC = pIn[8] + 2 * pIn[9] + pIn[10];
pOut[2] = sumC / 4;
unsigned int sumD = pIn[12] + 2 * pIn[13] + pIn[14];
pOut[3] = sumD / 4;
pOut +=4;

读写真的很混。循环周期的稍微好一点的版本是

// and the pIn reads can be combined into a single 4-byte fetch
sumA = pIn[0] + 2 * pIn[1] + pIn[2];
sumB = pIn[4] + 2 * pIn[5] + pIn[6];
sumC = pIn[8] + 2 * pIn[9] + pIn[10];
sumD = pIn[12] + 2 * pIn[13] + pIn[14];
pOut +=4;
pOut[0] = sumA / 4;
pOut[1] = sumB / 4;
pOut[2] = sumC / 4;
pOut[3] = sumD / 4;

请记住,此处的“unsigned in sumA”行实际上可能意味着 alloca() 调用(堆栈上的分配),因此您在临时 var 分配(函数调用 4)上浪费了很多周期次)。

此外,pIn[i] 索引仅从内存中提取单字节。更好的方法是读取 int 然后提取单个字节。为了加快速度,使用“unsgined int*”读取 4 个字节 (pIn[i * 4 + 0], pIn[i * 4 + 1], pIn[i * 4 + 2], pIn[i * 4 + 3]).

NEON 版本显然更胜一筹:线条

             "# load 8 pixels:             \n"
"vld4.8 {d0-d3}, [%1]! \n"

             "#save everything in one shot   \n"
"vst1.8 {d7}, [%0]! \n"

节省大部分时间用于内存访问。

关于iphone - C vs 汇编程序 vs NEON 性能,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/11508172/

25 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com