c - 如何帮助 gcc 向量化 C 代码-6ren

c - 如何帮助 gcc 向量化 C 代码

转载作者：太空狗更新时间：2023-10-29 17:03:28

我有以下 C 代码。第一部分只是将标准输入的复数矩阵读入名为 M 的矩阵中。 .有趣的部分是第二部分。

#include <stdio.h>
#include <complex.h>
#include <stdlib.h>
#include <assert.h>
#include <math.h>

int main() {
    int n, m, c, d;
    float re, im;

    scanf("%d %d", &n, &m);
    assert(n==m);
    complex float M[n][n];

    for(c=0; c<n; c++) {
      for(d=0; d<n; d++) {
    scanf("%f%fi", &re, &im);
    M[c][d] = re + im * I;
      }
    }

    for(c=0; c<n; c++) {
      for(d=0; d<n; d++) {
        printf("%.2f%+.2fi ", creal(M[c][d]), cimag(M[c][d]));
      }
      printf("\n");
    }
/*
Example:input   
2 3
1+2i 2+3i 74-4i
3+4i 4+5i -7-8i
*/
    /* Part 2. M is now an n by n matrix of complex numbers */
    int s=1, i, j;
    int *f = malloc(n * sizeof *f);
    complex float *delta = malloc(n * sizeof *delta);
    complex float *v = malloc(n * sizeof *v);
    complex float p = 1, prod;

    for (i = 0; i < n; i++) {
      v[i] = 0;
      for (j = 0; j <n; j++) {
        v[i] += M[j][i];
      }
      p *= v[i];
      f[i] = i;
      delta[i] = 1;
    }
    j = 0;
    while (j < n-1) {
      prod = 1.;
      for (i = 0; i < n; i++) {
        v[i] -= 2.*delta[j]*M[j][i];
        prod *= v[i];
      }
      delta[j] = -delta[j];
      s = -s;            
      p += s*prod;
      f[0] = 0;
      f[j] = f[j+1];
      f[j+1] = j+1;
      j = f[0];
    }
    free(delta);
    free(f);
    free(v);
    printf("%f + i%f\n", creal(p/pow(2.,(n-1))), cimag(p/pow(2.,(n-1))));
    return 0;
}

我用 gcc -fopt-info-vec-all -O3 -ffast-math -march=bdver2 permanent-in-c.c -lm 编译.这向我解释了为什么几乎没有循环被矢量化。

性能最重要的部分是第 47--50 行，它们是:

for (i = 0; i < n; i++) {
    v[i] -= 2.*delta[j]*M[j][i];
    prod *= v[i];
}

gcc 告诉我:

permanent-in-c.c:47:7: note: reduction used in loop.
permanent-in-c.c:47:7: note: Unknown def-use cycle pattern.
permanent-in-c.c:47:7: note: reduction used in loop.
permanent-in-c.c:47:7: note: Unknown def-use cycle pattern.
permanent-in-c.c:47:7: note: Unsupported pattern.
permanent-in-c.c:47:7: note: not vectorized: unsupported use in stmt.
permanent-in-c.c:47:7: note: unexpected pattern.
[...]
permanent-in-c.c:48:26: note: SLP: step doesn't divide the vector-size.
permanent-in-c.c:48:26: note: Unknown alignment for access: IMAGPART_EXPR <*M.4_40[j_202]{lb: 0 sz: pretmp_291 * 4}[i_200]>
permanent-in-c.c:48:26: note: SLP: step doesn't divide the vector-size.
permanent-in-c.c:48:26: note: Unknown alignment for access: REALPART_EXPR <*M.4_40[j_202]{lb: 0 sz: pretmp_291 * 4}[i_200]>
[...]
permanent-in-c.c:48:26: note: Build SLP failed: unrolling required in basic block SLP
permanent-in-c.c:48:26: note: Failed to SLP the basic block.
permanent-in-c.c:48:26: note: not vectorized: failed to find SLP opportunities in basic block.

How can I fix the problems that are stopping this part from being vectorized?

奇怪的是，这部分是矢量化的，但我不知道为什么:

for (j = 0; j <n; j++) {
    v[i] += M[j][i];

gcc -fopt-info-vec-all -O3 -ffast-math -march=bdver2 Permanent-in-c.c -lm 的完整输出位于 https://bpaste.net/show/18ebc3d66a53 .

最佳答案

让我们首先详细检查代码。我们有

complex float  M[rows][cols];
complex float  v[cols];
float          delta[rows];
complex float  p = 1.0f;
float          s = 1.0f;

虽然 delta[]是 complex float输入 OP 的代码，它只包含 -1.0f或 +1.0f . (此外，如果是 -2.0f 或 +2.0f，则可以简化计算。)因此，我定义为实数而不是复杂的。

同样，OP 定义了 s如 int ，但有效地使用它作为 -1.0f和 +1.0f仅(在计算中)。这就是为什么我将其明确定义为 float .

我省略了 f数组，因为有一种简单的方法可以完全避免它。

代码有趣部分的第一个循环，

    for (i = 0; i < n; i++) {
      v[i] = 0;
      for (j = 0; j <n; j++) {
        v[i] += M[j][i];
      }
      p *= v[i];
      delta[i] = 1;
    }

执行几件事。它初始化 delta[] 中的所有元素数组为 1；它可以(并且可能应该)拆分为一个单独的循环。

由于外循环在 i 中增加, p将是 v 中元素的乘积;它也可以分成一个单独的循环。

因为内部循环对 i 列中的所有元素求和至 v[i] ，外循环和内循环简单地将每一行作为 vector 相加到 vector v .

因此，我们可以用伪代码重写上面的内容，如

Copy first row of matrix M to vector v
For r = 1 .. rows-1:
    Add complex values in row r of matrix M to vector v

p = product of complex elements in vector v

delta = 1.0f, 1.0f, 1.0f, .., 1.0f, 1.0f

让我们接下来看看第二个嵌套循环:

    j = 0;
    while (j < n-1) {
      prod = 1.;
      for (i = 0; i < n; i++) {
        v[i] -= 2.*delta[j]*M[j][i];
        prod *= v[i];
      }
      delta[j] = -delta[j];
      s = -s;            
      p += s*prod;
      f[0] = 0;
      f[j] = f[j+1];
      f[j+1] = j+1;
      j = f[0];
    }

除非您检查 j 的值，否则很难看到随着循环的进行，但外循环体的最后 4 行实现了 OEIS A007814 j 中的整数序列(0,1,0,2,0,1,0,3,0,1,0,2,0,1,0,4,...)。此循环中的迭代次数为 2rows-1-1。这部分序列是对称的，并实现了一个高度为rows-1的二叉树:

               4
       3               3
   2       2       2       2     (Read horizontally)
 1   1   1   1   1   1   1   1
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0

事实证明，如果我们循环遍历 i = 1 .. 2rows-1，然后 r是 i 中零低位的数量. GCC 提供了一个 __builtin_ctz() 内置函数，它正是计算这个。 (请注意， __builtin_ctz(0) 会产生一个未定义的值；所以不要这样做，即使它碰巧在您的计算机上产生了一个特定的值。)

内循环减去行 j 上的复数值的矩阵，按 2*delta[j] 缩放，来自 vector v[] .它还计算 vector v[] 中复杂条目的乘积(减法后)转化为变量 prod .

内循环后， delta[j]被否定，比例因子也是如此 s .变量值 prod ，按 s 缩放, 添加到 p .

循环后，最终(复杂)结果为 p除以 2rows-1。使用 ldexp() 可以更好地完成此操作C99 函数(实部和复部分开)。

因此，我们可以用伪代码重写第二个嵌套循环，如

s = 1.0f

For k = 1 .. rows-1, inclusive:
    r = __builtin_ctz(k), i.e. number of least
                          significant bits that
                          are zero in k

    Subtract the complex values on row r of matrix M,
        scaled by delta[r], from vector v[]

    prod = the product of (complex) elements in vector v[]

    Negate scale factor s (changing its sign)

    Add prod times s to result p

根据我的经验，最好将实部和虚部拆分为单独的 vector 和矩阵。考虑以下定义:

typedef struct {
    float  *real;
    float  *imag;
    size_t  floats_per_row;  /* Often 'stride' */
    size_t  rows;
    size_t  cols;
} complex_float_matrix;

/* Set an array of floats to a specific value */
void float_set(float *, float, size_t);

/* Copy an array of floats */
void float_copy(float *, const float *, size_t);

/* malloc() vector-aligned memory; size in floats */
float *float_malloc(size_t);

/* Elementwise addition of floats */
void float_add(float *, const float *, size_t);

/* Elementwise addition of floats, scaled by a real scale factor */
void float_add_scaled(float *, const float *, float, size_t);

/* Complex float product, separate real and imag arrays */
complex float complex_float_product(const float *, const float *, size_t);

以上都是很容易矢量化的，只要 float_malloc()产生足够对齐的指针(并且编译器被告知，例如通过 GCC 函数属性 __attribute__ ((__assume_aligned__ (BYTES_IN_VECTOR))); )和 floats_per_row矩阵中的成员是 vector 中浮点数的倍数。

(我不知道 GCC 是否可以自动向量化上述函数，但我知道可以使用 GCC vector 扩展“手动”向量化它们。)

有了上面，代码的整个有趣部分，在伪 C 中，变成

complex float permanent(const complex_float_matrix *m)
{
    float         *v_real, *v_imag;
    float         *scale;   /* OP used 'delta' */
    complex float  result;  /* OP used 'p'     */
    complex float  product; /* OP used 'prod'  */
    float          coeff = 1.0f; /* OP used 's' */
    size_t         i = 1 << (m->rows - 1);
    size_t         r;

    if (!m || m->cols < 1 || m->rows < 1 || !i) {
        /* TODO: No input matrix, or too many rows in input matrix */
    }

    v_real = float_malloc(m->cols);
    v_imag = float_malloc(m->cols);
    scale  = float_malloc(m->rows - 1);
    if (!v_real || !v_imag || !scale) {
        free(scale);
        free(v_imag);
        free(v_real);
        /* TODO: Out of memory error */
    }

    float_set(scale, 2.0f, m->rows - 1);

    /* Sum matrix rows to v. */

    float_copy(v_real, m->real, m->cols);
    for (r = 1; r < m->rows; r++)
        float_add(v_real, m->real + r * m->floats_per_row, m->cols);

    float_copy(v_imag, m->imag, m->cols);
    for (r = 1; r < m->rows; r++)
        float_add(v_imag, m->imag + r * m->floats_per_row, m->cols);

    result = complex_float_product(v_real, v_imag, m->cols);

    while (--i) {
        r = __builtin_ctz(i);

        scale[r] = -scale[r];

        float_add_scaled(v_real, m->real + r * m->floats_per_row, m->cols);
        float_add_scaled(v_imag, m->imag + r * m->floats_per_row, m->cols);

        product = complex_float_product(v_real, v_imag, m->cols);

        coeff = -coeff;

        result += coeff * product;
    }

    free(scale);
    free(v_imag);
    free(v_real);

    return result;
}

在这一点上，我会在没有矢量化的情况下亲自实现上述内容，然后对其进行广泛的测试，直到我确定它可以正常工作。

然后，我会检查 GCC 程序集输出 ( -S ) 以查看它是否可以充分矢量化各个操作(我之前列出的函数)。

使用 GCC vector 扩展手动向量化函数非常简单。声明一个浮点 vector 很简单:

typedef  float  vec2f __attribute__((vector_size (8), aligned (8)));  /* 64 bits; MMX, 3DNow! */
typedef  float  vec4f __attribute__((vector_size (16), aligned (16))); /* 128 bits; SSE */
typedef  float  vec8f __attribute__((vector_size (32), aligned (32))); /* 256 bits; AVX, AVX2 */
typedef  float  vec16f __attribute__((vector_size (64), aligned (64))); /* 512 bits; AVX512F */

每个 vector 中的各个分量可以使用数组表示法( v[0] 和 v[1] 用于 vec2f v; )进行寻址。 GCC 可以按元素对整个 vector 进行基本操作；我们在这里只需要加法和乘法。应该避免水平操作(应用于同一 vector 中元素之间的操作)，而是对元素进行重新排序。

即使在没有这种矢量化的体系结构上，GCC 也会为上述 vector 大小生成工作代码，但生成的代码可能很慢。 (至少 5.4 之前的 GCC 版本会产生很多不必要的移动，通常是堆叠和返回。)

为 vector 分配的内存需要充分对齐。 malloc()并非在所有情况下都提供足够对齐的内存；你应该使用 posix_memalign() 反而。 aligned属性可用于增加 GCC 用于 vector 类型的对齐方式，在本地或静态分配时。在矩阵中，您需要确保行从足够对齐的边界开始；这就是我拥有 floats_per_row 的原因结构中的变量。

如果 vector (或行)中的元素数量很大，但不是 vector 中浮点数的倍数，则应使用“惰性”值填充 vector ——不影响结果的值，喜欢 0.0f用于加法和减法，以及 1.0f为乘法。

至少在 x86 和 x86-64 上，GCC 将为仅使用指针的循环生成更好的代码。例如，这

void float_set(float *array, float value, size_t count)
{
    float *const limit = array + count;
    while (array < limit)
        *(array++) = value;
}

产生比

void float_set(float *array, float value, size_t count)
{
    size_t i;
    for (i = 0; i < count; i++)
        array[i] = value;
}

或者

void float_set(float *array, float value, size_t count)
{
    while (count--)
        *(array++) = value;
}

(这往往会产生类似的代码)。就个人而言，我会将其实现为

void float_set(float *array, float value, size_t count)
{
    if (!((uintptr_t)array & 7) && !(count & 1)) {
        uint64_t *const end = (uint64_t *)__builtin_assume_aligned((void *)(array + count), 8);
        uint64_t       *ptr = (uint64_t *)__builtin_assume_aligned((void *)array, 8);
        uint64_t        val;

        __builtin_memcpy(&val, &value, 4);
        __builtin_memcpy(4 + (char *)&val, &value, 4);

        while (ptr < end)
            *(ptr++) = val;
    } else {
        uint32_t *const end = (uint32_t *)__builtin_assume_aligned((void *)(array + count), 4);
        uint32_t       *ptr = (uint32_t *)__builtin_assume_aligned((void *)array, 4);
        uint32_t        val;

        __builtin_memcpy(&val, &value, 4);

        while (ptr < end)
            *(ptr++) = val;
    }
}

和 float_copy()作为

void float_copy(float *target, const float *source, size_t count)
{
    if (!((uintptr_t)source & 7) &&
        !((uintptr_t)target & 7) && !(count & 1)) {
        uint64_t *const end = (uint64_t *)__builtin_assume_aligned((void *)(array + count), 8);
        uint64_t       *ptr = (uint64_t *)__builtin_assume_aligned((void *)target, 8);
        uint64_t       *src = (uint64_t *)__builtin_assume_aligned((void *)source, 8);

        while (ptr < end)
            *(ptr++) = *(src++);

    } else {
        uint32_t *const end = (uint32_t *)__builtin_assume_aligned((void *)(array + count), 4);
        uint32_t       *ptr = (uint32_t *)__builtin_assume_aligned((void *)array, 4);
        uint32_t       *src = (uint32_t *)__builtin_assume_aligned((void *)source, 4);

        while (ptr < end)
            *(ptr++) = *(src++);
    }
}

或接近的东西。

最难矢量化的是 complex_float_product() .如果用 1.0f 填充最终 vector 中未使用的元素为实部和 0.0f对于虚部，您可以轻松计算每个 vector 的复数乘积。请记住

(a + b i) × (c + d i) = (a c - b d) + (a d + b c) i

这里的难点是有效地计算 vector 中元素的复数乘积。幸运的是，这部分对整体性能来说根本不是关键(除了非常短的 vector ，或者列很少的矩阵)，所以在实践中应该没有太大关系。

(简而言之，“难”的部分是找到重新排序元素以最大限度地使用压缩 vector 乘法的方法，而不需要这么多的洗牌/移动来减慢它的速度。)

对于 float_add_scaled()函数，您应该创建一个填充比例因子的 vector ；类似以下内容，

void float_add_scaled(float *array, const float *source, float scale, size_t count)
{
    const vec4f  coeff = { scale, scale, scale, scale };
    vec4f       *ptr = (vec4f *)__builtin_assume_aligned((void *)array, 16);
    vec4f *const end = (vec4f *)__builtin_assume_aligned((void *)(array + count), 16);
    const vec4f *src = (vec4f *)__builtin_assume_aligned((void *)source, 16);

    while (ptr < end)
        *(ptr++) += *(src++) * coeff;
}

如果我们忽略对齐和大小检查以及回退实现。

关于c - 如何帮助 gcc 向量化 C 代码，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/41639654/

文章推荐： asp.net-mvc - ASP.NET MVC 以小写形式呈现 html 标签和属性

文章推荐： python - cython中def、cdef、cpdef的定义

文章推荐： angular - Observable.subscribe() 只工作一次 (ngrx\angular2)

文章推荐： python - 有没有办法在 Python Selenium 中通过属性查找元素？

c++ - C c;之间有什么区别吗？和 C c = C();?
#include using namespace std; class C{ private: int value; public: C(){ value = 0;
c++ - C 风格字符串差异 : C/C++
这个问题已经有答案了: What is the difference between char a[] = ?string?; and char *p = ?string?;? (8 个回答) 已关闭
c++ - c\c++ 转换为 C#
关闭。此题需要details or clarity 。目前不接受答案。想要改进这个问题吗？通过 editing this post 添加详细信息并澄清问题. 已关闭 7 年前。此帖子已于 8 个月
c# - C、C++、C# 的功能测试工具
除了调试之外，是否有任何针对 c、c++ 或 c# 的测试工具，其工作原理类似于将独立函数复制粘贴到某个文本框，然后在其他文本框中输入参数？最佳答案也许您会考虑单元测试。我推荐你谷歌测试和谷歌模拟
c# - C/C++/C# 在监视器上设置窗口位置
我想在第二台显示器中移动一个窗口 (HWND)。问题是我尝试了很多方法，例如将分辨率加倍或输入负值，但它永远无法将窗口放在我的第二台显示器上。关于如何在 C/C++/c# 中执行此操作的任何线索最
c# - C/C++/C#中的DES实现
我正在寻找 C/C++/C## 中不同类型 DES 的现有实现。我的运行平台是Windows XP/Vista/7。我正在尝试编写一个 C# 程序，它将使用 DES 算法进行加密和解密。我需要一些实
c# - 在条件中使用赋值是否安全？ C/C++、C#
很难说出这里要问什么。这个问题模棱两可、含糊不清、不完整、过于宽泛或夸夸其谈，无法以目前的形式得到合理的回答。如需帮助澄清此问题以便重新打开，visit the help center . 关闭 1
c++ - C/C++/C# 强制窗口在最上面
有没有办法强制将另一个窗口置于顶部？不是应用程序的窗口，而是另一个已经在系统上运行的窗口。 (Windows, C/C++/C#) 最佳答案 SetWindowPos(that_window_ha
c# - 套接字服务器应用程序的选择 : C/C++ or C#
假设您可以在 C/C++ 或 Csharp 之间做出选择，并且您打算在 Windows 和 Linux 服务器上运行同一服务器的多个实例，那么构建套接字服务器应用程序的最明智选择是什么？最佳答案如
c++ - C/C++ 运行时库和 C/C++ 标准库的区别
你们能告诉我它们之间的区别吗？顺便问一下，有什么叫C++库或C库的吗？最佳答案 C++ 标准库和 C 标准库是 C++ 和 C 标准定义的库，提供给 C++ 和 C 程序使用。那是那些词的共同
c++ - &C::c 和 &(C::c) 有什么区别？
下面的测试代码，我将输出信息放在注释中。我使用的是 gcc 4.8.5 和 Centos 7.2。 #include #include class C { public:
c++ - 什么 C++(通用 (c/c++) 与 (通用 c)/c++ )
很难说出这里问的是什么。这个问题是含糊的、模糊的、不完整的、过于宽泛的或修辞性的，无法以目前的形式得到合理的回答。如需帮助澄清此问题以便重新打开它，visit the help center 。已关
c# - 通过网络在 C/C++ 服务器、C/C++ 和 C# 客户端之间发送数据结构
我的客户将使用名为 annoucement 的结构/类与客户通信。我想我会用 C++ 编写服务器。会有很多不同的类继承annoucement。我的问题是通过网络将这些类发送给客户端我想也许我应该使用
c# - C/C++ - 如何将 Buffer.BlockCopy (C#) 转换为 C/C++
我在 C# 中有以下函数: public Matrix ConcatDescriptors(IList> descriptors) { int cols = descriptors[0].Co
c++ - C/C++ - 对其他人隐藏 C 或 C++ 函数代码
我有一个项目要编写一个函数来对某些数据执行某些操作。我可以用 C/C++ 编写代码，但我不想与雇主共享该函数的代码。相反，我只想让他有权在他自己的代码中调用该函数。是否可以？我想到了这两种方法 - 在
c# - 在托管代码(C++、C、C++/CLI、C#)中使用非托管代码时处理错误
我使用的是编写糟糕的第 3 方 (C/C++) Api。我从托管代码(C++/CLI)中使用它。有时会出现“访问冲突错误”。这使整个应用程序崩溃。我知道我无法处理这些错误[如果指针访问非法内存位置等，
c# - C#、C/C++ 或 Objective-C 中的眼动追踪库
关闭。这个问题不符合Stack Overflow guidelines .它目前不接受答案。我们不允许提问寻求书籍、工具、软件库等的推荐。您可以编辑问题，以便用事实和引用来回答。关闭 7 年前。
c++ - C/C++/Objective-C 文本识别库
已关闭。此问题不符合Stack Overflow guidelines 。目前不接受答案。要求我们推荐或查找工具、库或最喜欢的场外资源的问题对于 Stack Overflow 来说是偏离主题的，因为
c# - 将 C/C++ 函数导入 C#
我有一些 C 代码，将使用 P/Invoke 从 C# 调用。我正在尝试为这个 C 函数定义一个 C# 等效项。 SomeData* DoSomething(); struct SomeData {
c - C语言中 "c -= --c - c++;"的结果应该是什么？
这个问题已经有答案了: Why are these constructs using pre and post-increment undefined behavior? (14 个回答) 已关闭 6

太空狗

个人简介

我是一名优秀的程序员,十分优秀！

作者热门文章

滴滴打车优惠券免费领取

全站热门文章

首页

博学

6Ren·AI

商城

c - 如何帮助 gcc 向量化 C 代码