gpt4 book ai didi

c++ - 简单的 SSE 循环比非 SSE 版本慢

转载 作者:太空宇宙 更新时间:2023-11-04 12:05:46 27 4
gpt4 key购买 nike

我正在尝试比较 SSE float[4] 添加与标准 float[4] 添加。作为演示,我在使用和不使用 SSE 的情况下计算求和分量的总和:

#include <iostream>
#include <vector>

struct Point4
{
Point4()
{
data[0] = 0;
data[1] = 0;
data[2] = 0;
data[3] = 0;
}

float data[4];
};

void Standard()
{
Point4 a;
a.data[0] = 1.0f;
a.data[1] = 2.0f;
a.data[2] = 3.0f;
a.data[3] = 4.0f;

Point4 b;
b.data[0] = 1.0f;
b.data[1] = 6.0f;
b.data[2] = 3.0f;
b.data[3] = 5.0f;

float total = 0.0f;
for(unsigned int i = 0; i < 1e9; ++i)
{
for(unsigned int component = 0; component < 4; ++component)
{
total += a.data[component] + b.data[component];
}
}

std::cout << "total: " << total << std::endl;
}

void Vectorized()
{
typedef float v4sf __attribute__ (( vector_size(4*sizeof(float)) ));

v4sf a;
float* aPointer = (float*)&a;
aPointer[0] = 1.0f; aPointer[1] = 2.0f; aPointer[2] = 3.0f; aPointer[3] = 4.0f;

v4sf b;
float* bPointer = (float*)&b;
bPointer[0] = 1.0f; bPointer[1] = 6.0f; bPointer[2] = 3.0f; bPointer[3] = 5.0f;

v4sf result;
float* resultPointer = (float*)&result;
resultPointer[0] = 0.0f;
resultPointer[1] = 0.0f;
resultPointer[2] = 0.0f;
resultPointer[3] = 0.0f;

for(unsigned int i = 0; i < 1e9; ++i)
{
result += a + b; // Vectorized operation
}

// Sum the components of the result (this is done with the "total += " in the Standard() loop
float total = 0.0f;
for(unsigned int component = 0; component < 4; ++component)
{
total += resultPointer[component];
}
std::cout << "total: " << total << std::endl;
}

int main()
{

// Standard();

Vectorized();

return 0;
}

但是,使用标准方法的代码似乎比使用矢量化方法(~.4 秒)更快(~.2 秒)。是因为 for 循环对 v4sf 值求和吗?有没有更好的操作我可以用来计算这两种技术之间的差异并仍然比较输出以确保两者之间没有差异?

最佳答案

然后你的版本因为 SSE 变慢的原因是你必须在每次迭代中从 SSE 寄存器解包到标量寄存器 4 次,这比你从矢量化加法中获得的开销更多。看看反汇编,你应该得到一个更清晰的画面。

我想你想要做的是以下(使用 SSE 速度更快):

for(unsigned int i = 0; i < 1e6; ++i)
{
result += a + b; // Vectorized operation
}

// Sum the components of the result (this is done with the "total += " in the Standard() loop
for(unsigned int component = 0; component < 4; ++component)
{
total += resultPointer[component];
}

还有以下可能会更快:

for(unsigned int i = 0; i < 1e6/4; ++i)
{
result0 += a + b; // Vectorized operation
result1 += a + b; // Vectorized operation
result2 += a + b; // Vectorized operation
result3 += a + b; // Vectorized operation
}

// Sum the components of the result (this is done with the "total += " in the Standard() loop
for(unsigned int component = 0; component < 4; ++component)
{
total += resultPointer0[component];
total += resultPointer1[component];
total += resultPointer2[component];
total += resultPointer3[component];
}

关于c++ - 简单的 SSE 循环比非 SSE 版本慢,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/12186193/

27 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com