gpt4 book ai didi

c++ - 使用 Intel Intrinsics 时代码不会加速

转载 作者:行者123 更新时间:2023-11-28 05:38:27 26 4
gpt4 key购买 nike

我正在使用内部函数来加速运行的 openCV 代码。但是在我用 Intrinsics 替换代码之后,代码的运行时成本几乎相同,甚至可能更糟。我不知道这是什么以及为什么会这样。我已经搜索这个问题很长时间了,但注意到了变化。如果有人可以帮助我,我将不胜感激。非常感谢!这是我的代码

      // if useSSE is true,run the code with intrinsics and takes 1.45ms in my computer 
// and if not run the general code and takes the same time.
cv::Mat<float> results(shape.rows,2);
if (useSSE) {
float* pshape = (float*)shape.data;
results = shape.clone();
float* presults = (float*)results.data;
// use SSE
__m128 xyxy_center = _mm_set_ps(bbox.center_y, bbox.center_x, bbox.center_y, bbox.center_x);

float bbox_width = bbox.width/2;
float bbox_height = bbox.height/2;
__m128 xyxy_size = _mm_set_ps(bbox_height, bbox_width, bbox_height, bbox_width);
gettimeofday(&start, NULL); // this is for counting time

int shape_size = shape.rows*shape.cols;
for (int i=0; i<shape_size; i +=4) {
__m128 a = _mm_loadu_ps(pshape+i);
__m128 result = _mm_div_ps(_mm_sub_ps(a, xyxy_center), xyxy_size);
_mm_storeu_ps(presults+i, result);
}
}else {
//SSE TO BE DONE
for (int i = 0; i < shape.rows; i++){
results(i, 0) = (shape(i, 0) - bbox.center_x) / (bbox.width / 2.0);
results(i, 1) = (shape(i, 1) - bbox.center_y) / (bbox.height / 2.0);
}
}
gettimeofday(&end, NULL);
diff = 1000000*(end.tv_sec-start.tv_sec)+end.tv_sec-start.tv_usec;
std::cout<<diff<<"-----"<<std::endl;
return results;

最佳答案

  1. 如果 shape.rows % 2 == 1,您的 SSE 优化将破坏结果变量附近的内存
  2. 尽量避免在循环中使用 i 变量,直接使用指针。编译器可能会优化额外的加操作,也可能不会。
  3. 用乘法代替除法:

    float bbox_width_inv = 2./bbox.width;
    float bbox_height_inv = 2./bbox.height;
    __m128 xyxy_size = _mm_set_ps(bbox_height, bbox_width, bbox_height, bbox_width);
    float* p_shape_end = p_shape + shape.rows*shape.cols;
    float* p_shape_end_batch = p_shape + shape.rows*shape.cols & (~3);
    for (; p_shape<p_shape_end_batch; p_shape+=4, presults+=4) {
    __m128 a = _mm_loadu_ps(pshape);
    __m128 result = _mm_mul_ps(_mm_sub_ps(a, xyxy_center), xyxy_size_inv);
    _mm_storeu_ps(presults, result);
    }
    while (p_shape < p_shape_end) {
    presults++ = (p_shape++ - bbox.center_x) * bbox_width_inv;
    presults++ = (p_shape++ - bbox.center_y) * bbox_height_inv;
    }
  4. 尝试反汇编从内在函数生成的代码,并确保有足够的寄存器来执行您的操作,并且不会将临时结果存储到 RAM 中

关于c++ - 使用 Intel Intrinsics 时代码不会加速,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/37699654/

26 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com