c - 在 L1 缓存 : only getting 62% 中获取 Haswell 上的峰值带宽-6ren

c - 在 L1 缓存 : only getting 62% 中获取 Haswell 上的峰值带宽

转载作者：行者123 更新时间：2023-12-01 18:35:36

我正在尝试在 L1 缓存中为英特尔处理器上的以下功能获得全部带宽

float triad(float *x, float *y, float *z, const int n) {
    float k = 3.14159f;
    for(int i=0; i<n; i++) {
        z[i] = x[i] + k*y[i];
    }
}

这是来自 STREAM 的三元组函数.

使用具有此功能的 SandyBridge/IvyBridge 处理器(使用 NASM 汇编)，我获得了大约 95% 的峰值。但是，除非展开循环，否则使用 Haswell 我只能达到峰值的 62%。如果我展开 16 次，我会得到 92%。我不明白这个。

我决定使用 NASM 在汇编中编写我的函数。汇编中的主循环如下所示。

.L2:
    vmovaps         ymm1, [rdi+rax]
    vfmadd231ps     ymm1, ymm2, [rsi+rax]
    vmovaps         [rdx+rax], ymm1
    add             rax, 32
    jne             .L2

原来在 Agner Fog's Optimizing Assembly manual在示例 12.7-12.11 中，他为 Pentium M、Core 2、Sandy Bridge、FMA4 和 FMA3 做了几乎相同的事情(但对于 y[i] = y[i] +k*x[i])。我设法或多或少地自己重现了他的代码(实际上他在广播时在 FMA3 示例中有一个小错误)。除了 FMA4 和 FMA3 之外，他在表中给出了每个处理器的指令大小计数、融合操作、执行端口。我曾尝试为 FMA3 自己制作这张 table 。

                                 ports
             size   μops-fused   0   1   2   3   4   5   6   7    
vmovaps      5      1                    ½   ½
vfmadd231ps  6      1            ½   ½   ½   ½
vmovaps      5      1                            1           1
add          4      ½                                    ½
jne          2      ½                                    ½
--------------------------------------------------------------
total       22      4            ½   ½   1   1   1   0   1   1

大小是指以字节为单位的指令长度。原因 add和 jne指令有半个 μop 是它们融合成一个宏操作(不要与仍然使用多个端口的 μop 融合混淆)并且只需要端口 6 和一个 μop。 vfmadd231ps指令可以使用端口0或端口1。我选择了端口0。负载 vmovaps可以使用端口 2 或 3。我选择了 2 并且有 vfmadd231ps使用端口 3 .. 为了与 Agner Fog 的表保持一致，并且因为我认为说一条可以去不同端口的指令平均分配给每个 1/2 的时间更有意义，我分配了 1/2 用于端口 vmovaps和 vmadd231ps可以去。

基于此表以及所有 Core2 处理器每个时钟周期可以执行 4 μop 的事实，似乎每个时钟周期都应该可以执行此循环，但我还没有设法获得它。 有人可以向我解释为什么我无法在不展开的情况下接近 Haswell 上此功能的峰值带宽吗？这是否可以不展开，如果可以，怎么做？ 让我明确一点，我真的想最大化这个函数的 ILP(我不仅想要最大带宽)，所以这就是我不想展开的原因。

编辑:
这是一个更新，因为 Iwillnotexist Idonotexist 使用 IACA 显示商店从不使用端口 7。我设法在没有展开的情况下打破了 66% 的障碍，并且在没有展开的情况下每次迭代在一个时钟周期内执行此操作(理论上)。让我们首先解决商店问题。

Stephen Canon 在评论中提到，端口 7 中的地址生成单元 (AGU) 只能处理诸如 [base + offset] 之类的简单操作。而不是 [base + index] .在 Intel optimization reference manual我发现的唯一一件事是对 port7 的评论，上面写着“Simple_AGU”，但没有定义简单的含义。但是后来在 IACA的评论中发现了Iwillnotexist Idonotexist六个月前，英特尔的一名员工在 2014 年 3 月 11 日写道:

Port7 AGU can only work on stores with simple memory address (no index register).

Stephen Canon 建议“使用存储地址作为加载操作数的偏移量”。我试过这样

vmovaps         ymm1, [rdi + r9 + 32*i]
vfmadd231ps     ymm1, ymm2, [rsi + r9 + 32*i]
vmovaps         [r9 + 32*i], ymm1
add             r9, 32*unroll
cmp             r9, rcx
jne             .L2

这确实导致商店使用端口 7。但是，它还有另一个问题，即 vmadd231ps不与您可以从 IACA 看到的负载熔断。它还需要额外的 cmp我原来的功能没有的指令。所以商店少用了一个微操作，但 cmp (或者更确切地说是 add 因为 cmp 宏与 jne 融合)还需要一个。 IACA 报告的块吞吐量为 1.5。在实践中，这只能达到峰值的 57%。

但我找到了获取 vmadd231ps 的方法指令也与负载熔断。这只能使用静态数组来完成，像这样寻址 [绝对 32 位地址 + 索引]。 Evgeny Kluev original suggested this .

vmovaps         ymm1, [src1_end + rax]
vfmadd231ps     ymm1, ymm2, [src2_end + rax]
vmovaps         [dst_end + rax], ymm1
add             rax, 32
jl              .L2

哪里 src1_end , src2_end , 和 dst_end是静态数组的结束地址。

这用我预期的四个融合的微操作重现了我的问题中的表格。 如果将其放入 IACA，它会报告 1.0 的块吞吐量。从理论上讲，这应该与 SSE 和 AVX 版本一样好。在实践中，它大约达到峰值的 72%。这打破了 66% 的障碍，但距离我展开 16 次的 92% 仍有很长的路要走。因此，在 Haswell 上，接近顶峰的唯一选择是展开。这在通过 Ivy Bridge 的 Core2 上不是必需的，但在 Haswell 上是必需的。

End_edit:

这是用于测试的 C/C++ Linux 代码。 NASM 代码张贴在 C/C++ 代码之后。您唯一需要更改的是频率数。在线留言 double frequency = 1.3;将 1.3 替换为处理器的任何运行(非标称)频率(如果 i5-4250U 在 BIOS 中禁用涡轮增压，则为 1.3 GHz)。

编译

nasm -f elf64 triad_sse_asm.asm
nasm -f elf64 triad_avx_asm.asm
nasm -f elf64 triad_fma_asm.asm
g++ -m64 -lrt -O3 -mfma  tests.cpp triad_fma_asm.o -o tests_fma
g++ -m64 -lrt -O3 -mavx  tests.cpp triad_avx_asm.o -o tests_avx
g++ -m64 -lrt -O3 -msse2 tests.cpp triad_sse_asm.o -o tests_sse

C/C++ 代码

#include <x86intrin.h>
#include <stdio.h>
#include <string.h>
#include <time.h>

#define TIMER_TYPE CLOCK_REALTIME

extern "C" float triad_sse_asm_repeat(float *x, float *y, float *z, const int n, int repeat);
extern "C" float triad_sse_asm_repeat_unroll16(float *x, float *y, float *z, const int n, int repeat);    
extern "C" float triad_avx_asm_repeat(float *x, float *y, float *z, const int n, int repeat);
extern "C" float triad_avx_asm_repeat_unroll16(float *x, float *y, float *z, const int n, int repeat); 
extern "C" float triad_fma_asm_repeat(float *x, float *y, float *z, const int n, int repeat);
extern "C" float triad_fma_asm_repeat_unroll16(float *x, float *y, float *z, const int n, int repeat);

#if (defined(__FMA__))
float triad_fma_repeat(float *x, float *y, float *z, const int n, int repeat) {
    float k = 3.14159f;
    int r;
    for(r=0; r<repeat; r++) {
        int i;
        __m256 k4 = _mm256_set1_ps(k);
        for(i=0; i<n; i+=8) {
            _mm256_store_ps(&z[i], _mm256_fmadd_ps(k4, _mm256_load_ps(&y[i]), _mm256_load_ps(&x[i])));
        }
    }
}
#elif (defined(__AVX__))
float triad_avx_repeat(float *x, float *y, float *z, const int n, int repeat) {
    float k = 3.14159f;
    int r;
    for(r=0; r<repeat; r++) {
        int i;
        __m256 k4 = _mm256_set1_ps(k);
        for(i=0; i<n; i+=8) {
            _mm256_store_ps(&z[i], _mm256_add_ps(_mm256_load_ps(&x[i]), _mm256_mul_ps(k4, _mm256_load_ps(&y[i]))));
        }
    }
}
#else
float triad_sse_repeat(float *x, float *y, float *z, const int n, int repeat) {
    float k = 3.14159f;
    int r;
    for(r=0; r<repeat; r++) {
        int i;
        __m128 k4 = _mm_set1_ps(k);
        for(i=0; i<n; i+=4) {
            _mm_store_ps(&z[i], _mm_add_ps(_mm_load_ps(&x[i]), _mm_mul_ps(k4, _mm_load_ps(&y[i]))));
        }
    }
}
#endif

double time_diff(timespec start, timespec end)
{
    timespec temp;
    if ((end.tv_nsec-start.tv_nsec)<0) {
        temp.tv_sec = end.tv_sec-start.tv_sec-1;
        temp.tv_nsec = 1000000000+end.tv_nsec-start.tv_nsec;
    } else {
        temp.tv_sec = end.tv_sec-start.tv_sec;
        temp.tv_nsec = end.tv_nsec-start.tv_nsec;
    }
    return (double)temp.tv_sec +  (double)temp.tv_nsec*1E-9;
}

int main () {
    int bytes_per_cycle = 0;
    double frequency = 1.3;  //Haswell
    //double frequency = 3.6;  //IB
    //double frequency = 2.66;  //Core2
    #if (defined(__FMA__))
    bytes_per_cycle = 96;
    #elif (defined(__AVX__))
    bytes_per_cycle = 48;
    #else
    bytes_per_cycle = 24;
    #endif
    double peak = frequency*bytes_per_cycle;

    const int n =2048;

    float* z2 = (float*)_mm_malloc(sizeof(float)*n, 64);
    char *mem = (char*)_mm_malloc(1<<18,4096);
    char *a = mem;
    char *b = a+n*sizeof(float);
    char *c = b+n*sizeof(float);

    float *x = (float*)a;
    float *y = (float*)b;
    float *z = (float*)c;

    for(int i=0; i<n; i++) {
        x[i] = 1.0f*i;
        y[i] = 1.0f*i;
        z[i] = 0;
    }
    int repeat = 1000000;
    timespec time1, time2;
    #if (defined(__FMA__))
    triad_fma_repeat(x,y,z2,n,repeat);
    #elif (defined(__AVX__))
    triad_avx_repeat(x,y,z2,n,repeat);
    #else
    triad_sse_repeat(x,y,z2,n,repeat);
    #endif

    while(1) {
        double dtime, rate;

        clock_gettime(TIMER_TYPE, &time1);
        #if (defined(__FMA__))
        triad_fma_asm_repeat(x,y,z,n,repeat);
        #elif (defined(__AVX__))
        triad_avx_asm_repeat(x,y,z,n,repeat);
        #else
        triad_sse_asm_repeat(x,y,z,n,repeat);
        #endif
        clock_gettime(TIMER_TYPE, &time2);
        dtime = time_diff(time1,time2);
        rate = 3.0*1E-9*sizeof(float)*n*repeat/dtime;
        printf("unroll1     rate %6.2f GB/s, efficency %6.2f%%, error %d\n", rate, 100*rate/peak, memcmp(z,z2, sizeof(float)*n));
        clock_gettime(TIMER_TYPE, &time1);
        #if (defined(__FMA__))
        triad_fma_repeat(x,y,z,n,repeat);
        #elif (defined(__AVX__))
        triad_avx_repeat(x,y,z,n,repeat);
        #else
        triad_sse_repeat(x,y,z,n,repeat);
        #endif
        clock_gettime(TIMER_TYPE, &time2);
        dtime = time_diff(time1,time2);
        rate = 3.0*1E-9*sizeof(float)*n*repeat/dtime;
        printf("intrinsic   rate %6.2f GB/s, efficency %6.2f%%, error %d\n", rate, 100*rate/peak, memcmp(z,z2, sizeof(float)*n));
        clock_gettime(TIMER_TYPE, &time1);
        #if (defined(__FMA__))
        triad_fma_asm_repeat_unroll16(x,y,z,n,repeat);
        #elif (defined(__AVX__))
        triad_avx_asm_repeat_unroll16(x,y,z,n,repeat);
        #else
        triad_sse_asm_repeat_unroll16(x,y,z,n,repeat);
        #endif
        clock_gettime(TIMER_TYPE, &time2);
        dtime = time_diff(time1,time2);
        rate = 3.0*1E-9*sizeof(float)*n*repeat/dtime;
        printf("unroll16    rate %6.2f GB/s, efficency %6.2f%%, error %d\n", rate, 100*rate/peak, memcmp(z,z2, sizeof(float)*n));
    }
}

使用 System V AMD64 ABI 的 NASM 代码。

triad_fma_asm.asm:

global triad_fma_asm_repeat
;RDI x, RSI y, RDX z, RCX n, R8 repeat
;z[i] = y[i] + 3.14159*x[i]
pi: dd 3.14159
;align 16
section .text
    triad_fma_asm_repeat:
    shl             rcx, 2  
    add             rdi, rcx
    add             rsi, rcx
    add             rdx, rcx
    vbroadcastss    ymm2, [rel pi]
    ;neg                rcx 

align 16
.L1:
    mov             rax, rcx
    neg             rax
align 16
.L2:
    vmovaps         ymm1, [rdi+rax]
    vfmadd231ps     ymm1, ymm2, [rsi+rax]
    vmovaps         [rdx+rax], ymm1
    add             rax, 32
    jne             .L2
    sub             r8d, 1
    jnz             .L1
    vzeroupper
    ret

global triad_fma_asm_repeat_unroll16
section .text
    triad_fma_asm_repeat_unroll16:
    shl             rcx, 2
    add             rcx, rdi
    vbroadcastss    ymm2, [rel pi]  
.L1:
    xor             rax, rax
    mov             r9, rdi
    mov             r10, rsi
    mov             r11, rdx
.L2:
    %assign unroll 32
    %assign i 0 
    %rep    unroll
        vmovaps         ymm1, [r9 + 32*i]
        vfmadd231ps     ymm1, ymm2, [r10 + 32*i]
        vmovaps         [r11 + 32*i], ymm1
    %assign i i+1 
    %endrep
    add             r9, 32*unroll
    add             r10, 32*unroll
    add             r11, 32*unroll
    cmp             r9, rcx
    jne             .L2
    sub             r8d, 1
    jnz             .L1
    vzeroupper
    ret

triad_ava_asm.asm:

global triad_avx_asm_repeat
;RDI x, RSI y, RDX z, RCX n, R8 repeat
pi: dd 3.14159
align 16
section .text
    triad_avx_asm_repeat:
    shl             rcx, 2  
    add             rdi, rcx
    add             rsi, rcx
    add             rdx, rcx
    vbroadcastss    ymm2, [rel pi]
    ;neg                rcx 

align 16
.L1:
    mov             rax, rcx
    neg             rax
align 16
.L2:
    vmulps          ymm1, ymm2, [rdi+rax]
    vaddps          ymm1, ymm1, [rsi+rax]
    vmovaps         [rdx+rax], ymm1
    add             rax, 32
    jne             .L2
    sub             r8d, 1
    jnz             .L1
    vzeroupper
    ret

global triad_avx_asm_repeat2
;RDI x, RSI y, RDX z, RCX n, R8 repeat
;pi: dd 3.14159
align 16
section .text
    triad_avx_asm_repeat2:
    shl             rcx, 2  
    vbroadcastss    ymm2, [rel pi]

align 16
.L1:
    xor             rax, rax
align 16
.L2:
    vmulps          ymm1, ymm2, [rdi+rax]
    vaddps          ymm1, ymm1, [rsi+rax]
    vmovaps         [rdx+rax], ymm1
    add             eax, 32
    cmp             eax, ecx
    jne             .L2
    sub             r8d, 1
    jnz             .L1
    vzeroupper
    ret

global triad_avx_asm_repeat_unroll16
align 16
section .text
    triad_avx_asm_repeat_unroll16:
    shl             rcx, 2
    add             rcx, rdi
    vbroadcastss    ymm2, [rel pi]  
align 16
.L1:
    xor             rax, rax
    mov             r9, rdi
    mov             r10, rsi
    mov             r11, rdx
align 16
.L2:
    %assign unroll 16
    %assign i 0 
    %rep    unroll
        vmulps          ymm1, ymm2, [r9 + 32*i]
        vaddps          ymm1, ymm1, [r10 + 32*i]
        vmovaps         [r11 + 32*i], ymm1
    %assign i i+1 
    %endrep
    add             r9, 32*unroll
    add             r10, 32*unroll
    add             r11, 32*unroll
    cmp             r9, rcx
    jne             .L2
    sub             r8d, 1
    jnz             .L1
    vzeroupper
    ret

triad_sse_asm.asm:

global triad_sse_asm_repeat
;RDI x, RSI y, RDX z, RCX n, R8 repeat
pi: dd 3.14159
;align 16
section .text
    triad_sse_asm_repeat:
    shl             rcx, 2  
    add             rdi, rcx
    add             rsi, rcx
    add             rdx, rcx
    movss           xmm2, [rel pi]
    shufps          xmm2, xmm2, 0
    ;neg                rcx 
align 16
.L1:
    mov             rax, rcx
    neg             rax
align 16
.L2:
    movaps          xmm1, [rdi+rax]
    mulps           xmm1, xmm2
    addps           xmm1, [rsi+rax]
    movaps          [rdx+rax], xmm1
    add             rax, 16
    jne             .L2
    sub             r8d, 1
    jnz             .L1
    ret

global triad_sse_asm_repeat2
;RDI x, RSI y, RDX z, RCX n, R8 repeat
;pi: dd 3.14159
;align 16
section .text
    triad_sse_asm_repeat2:
    shl             rcx, 2  
    movss           xmm2, [rel pi]
    shufps          xmm2, xmm2, 0
align 16
.L1:
    xor             rax, rax
align 16
.L2:
    movaps          xmm1, [rdi+rax]
    mulps           xmm1, xmm2
    addps           xmm1, [rsi+rax]
    movaps          [rdx+rax], xmm1
    add             eax, 16
    cmp             eax, ecx
    jne             .L2
    sub             r8d, 1
    jnz             .L1
    ret



global triad_sse_asm_repeat_unroll16
section .text
    triad_sse_asm_repeat_unroll16:
    shl             rcx, 2
    add             rcx, rdi
    movss           xmm2, [rel pi]
    shufps          xmm2, xmm2, 0
.L1:
    xor             rax, rax
    mov             r9, rdi
    mov             r10, rsi
    mov             r11, rdx
.L2:
    %assign unroll 8
    %assign i 0 
    %rep    unroll
        movaps          xmm1, [r9 + 16*i]
        mulps           xmm1, xmm2,
        addps           xmm1, [r10 + 16*i]
        movaps          [r11 + 16*i], xmm1
    %assign i i+1 
    %endrep
    add             r9, 16*unroll
    add             r10, 16*unroll
    add             r11, 16*unroll
    cmp             r9, rcx
    jne             .L2
    sub             r8d, 1
    jnz             .L1
    ret

最佳答案

IACA分析

使用 IACA (the Intel Architecture Code Analyzer)揭示宏操作融合确实发生了，这不是问题。正确的是神秘主义者:问题是商店根本不使用端口 7 .

IACA 报告如下:

Intel(R) Architecture Code Analyzer Version - 2.1
Analyzed File - ../../../tests_fma
Binary Format - 64Bit
Architecture  - HSW
Analysis Type - Throughput

Throughput Analysis Report
--------------------------
Block Throughput: 1.55 Cycles       Throughput Bottleneck: FrontEnd, PORT2_AGU, PORT3_AGU

Port Binding In Cycles Per Iteration:
---------------------------------------------------------------------------------------
|  Port  |  0   -  DV  |  1   |  2   -  D   |  3   -  D   |  4   |  5   |  6   |  7   |
---------------------------------------------------------------------------------------
| Cycles | 0.5    0.0  | 0.5  | 1.5    1.0  | 1.5    1.0  | 1.0  | 0.0  | 1.0  | 0.0  |
---------------------------------------------------------------------------------------

N - port number or number of cycles resource conflict caused delay, DV - Divider pipe (on port 0)
D - Data fetch pipe (on ports 2 and 3), CP - on a critical path
F - Macro Fusion with the previous instruction occurred
* - instruction micro-ops not bound to a port
^ - Micro Fusion happened
# - ESP Tracking sync uop was issued
@ - SSE instruction followed an AVX256 instruction, dozens of cycles penalty is expected
! - instruction not supported, was not accounted in Analysis

| Num Of |                    Ports pressure in cycles                     |    |
|  Uops  |  0  - DV  |  1  |  2  -  D  |  3  -  D  |  4  |  5  |  6  |  7  |    |
---------------------------------------------------------------------------------
|   1    |           |     | 1.0   1.0 |           |     |     |     |     | CP | vmovaps ymm1, ymmword ptr [rdi+rax*1]
|   2    | 0.5       | 0.5 |           | 1.0   1.0 |     |     |     |     | CP | vfmadd231ps ymm1, ymm2, ymmword ptr [rsi+rax*1]
|   2    |           |     | 0.5       | 0.5       | 1.0 |     |     |     | CP | vmovaps ymmword ptr [rdx+rax*1], ymm1
|   1    |           |     |           |           |     |     | 1.0 |     |    | add rax, 0x20
|   0F   |           |     |           |           |     |     |     |     |    | jnz 0xffffffffffffffec
Total Num Of Uops: 6

特别是，报告的周期 (1.5) 中的块吞吐量非常好，效率为 66%。

IACA's own website 上的帖子关于这个现象在 Tue, 03/11/2014 - 12:39一位英特尔员工在 Tue, 03/11/2014 - 23:20 上回复了这条回复。 :

Port7 AGU can only work on stores with simple memory address (no index register). This is why the above analysis doesn't show port7 utilization.

这坚定地解决了为什么没有使用端口 7。

现在，对比上面的 32x 展开循环(事实证明 unroll16 实际上应该被称为 unroll32 ):

Intel(R) Architecture Code Analyzer Version - 2.1
Analyzed File - ../../../tests_fma
Binary Format - 64Bit
Architecture  - HSW
Analysis Type - Throughput

Throughput Analysis Report
--------------------------
Block Throughput: 32.00 Cycles       Throughput Bottleneck: PORT2_AGU, Port2_DATA, PORT3_AGU, Port3_DATA, Port4, Port7

Port Binding In Cycles Per Iteration:
---------------------------------------------------------------------------------------
|  Port  |  0   -  DV  |  1   |  2   -  D   |  3   -  D   |  4   |  5   |  6   |  7   |
---------------------------------------------------------------------------------------
| Cycles | 16.0   0.0  | 16.0 | 32.0   32.0 | 32.0   32.0 | 32.0 | 2.0  | 2.0  | 32.0 |
---------------------------------------------------------------------------------------

N - port number or number of cycles resource conflict caused delay, DV - Divider pipe (on port 0)
D - Data fetch pipe (on ports 2 and 3), CP - on a critical path
F - Macro Fusion with the previous instruction occurred
* - instruction micro-ops not bound to a port
^ - Micro Fusion happened
# - ESP Tracking sync uop was issued
@ - SSE instruction followed an AVX256 instruction, dozens of cycles penalty is expected
! - instruction not supported, was not accounted in Analysis

| Num Of |                    Ports pressure in cycles                     |    |
|  Uops  |  0  - DV  |  1  |  2  -  D  |  3  -  D  |  4  |  5  |  6  |  7  |    |
---------------------------------------------------------------------------------
|   1    |           |     | 1.0   1.0 |           |     |     |     |     | CP | vmovaps ymm1, ymmword ptr [r9]
|   2^   | 1.0       |     |           | 1.0   1.0 |     |     |     |     | CP | vfmadd231ps ymm1, ymm2, ymmword ptr [r10]
|   2^   |           |     |           |           | 1.0 |     |     | 1.0 | CP | vmovaps ymmword ptr [r11], ymm1
|   1    |           |     | 1.0   1.0 |           |     |     |     |     | CP | vmovaps ymm1, ymmword ptr [r9+0x20]
|   2^   |           | 1.0 |           | 1.0   1.0 |     |     |     |     | CP | vfmadd231ps ymm1, ymm2, ymmword ptr [r10+0x20]
|   2^   |           |     |           |           | 1.0 |     |     | 1.0 | CP | vmovaps ymmword ptr [r11+0x20], ymm1
|   1    |           |     | 1.0   1.0 |           |     |     |     |     | CP | vmovaps ymm1, ymmword ptr [r9+0x40]
|   2^   | 1.0       |     |           | 1.0   1.0 |     |     |     |     | CP | vfmadd231ps ymm1, ymm2, ymmword ptr [r10+0x40]
|   2^   |           |     |           |           | 1.0 |     |     | 1.0 | CP | vmovaps ymmword ptr [r11+0x40], ymm1
|   1    |           |     | 1.0   1.0 |           |     |     |     |     | CP | vmovaps ymm1, ymmword ptr [r9+0x60]
|   2^   |           | 1.0 |           | 1.0   1.0 |     |     |     |     | CP | vfmadd231ps ymm1, ymm2, ymmword ptr [r10+0x60]
|   2^   |           |     |           |           | 1.0 |     |     | 1.0 | CP | vmovaps ymmword ptr [r11+0x60], ymm1
|   1    |           |     | 1.0   1.0 |           |     |     |     |     | CP | vmovaps ymm1, ymmword ptr [r9+0x80]
|   2^   | 1.0       |     |           | 1.0   1.0 |     |     |     |     | CP | vfmadd231ps ymm1, ymm2, ymmword ptr [r10+0x80]
|   2^   |           |     |           |           | 1.0 |     |     | 1.0 | CP | vmovaps ymmword ptr [r11+0x80], ymm1
|   1    |           |     | 1.0   1.0 |           |     |     |     |     | CP | vmovaps ymm1, ymmword ptr [r9+0xa0]
|   2^   |           | 1.0 |           | 1.0   1.0 |     |     |     |     | CP | vfmadd231ps ymm1, ymm2, ymmword ptr [r10+0xa0]
|   2^   |           |     |           |           | 1.0 |     |     | 1.0 | CP | vmovaps ymmword ptr [r11+0xa0], ymm1
|   1    |           |     | 1.0   1.0 |           |     |     |     |     | CP | vmovaps ymm1, ymmword ptr [r9+0xc0]
|   2^   | 1.0       |     |           | 1.0   1.0 |     |     |     |     | CP | vfmadd231ps ymm1, ymm2, ymmword ptr [r10+0xc0]
|   2^   |           |     |           |           | 1.0 |     |     | 1.0 | CP | vmovaps ymmword ptr [r11+0xc0], ymm1
|   1    |           |     | 1.0   1.0 |           |     |     |     |     | CP | vmovaps ymm1, ymmword ptr [r9+0xe0]
|   2^   |           | 1.0 |           | 1.0   1.0 |     |     |     |     | CP | vfmadd231ps ymm1, ymm2, ymmword ptr [r10+0xe0]
|   2^   |           |     |           |           | 1.0 |     |     | 1.0 | CP | vmovaps ymmword ptr [r11+0xe0], ymm1
|   1    |           |     | 1.0   1.0 |           |     |     |     |     | CP | vmovaps ymm1, ymmword ptr [r9+0x100]
|   2^   | 1.0       |     |           | 1.0   1.0 |     |     |     |     | CP | vfmadd231ps ymm1, ymm2, ymmword ptr [r10+0x100]
|   2^   |           |     |           |           | 1.0 |     |     | 1.0 | CP | vmovaps ymmword ptr [r11+0x100], ymm1
|   1    |           |     | 1.0   1.0 |           |     |     |     |     | CP | vmovaps ymm1, ymmword ptr [r9+0x120]
|   2^   |           | 1.0 |           | 1.0   1.0 |     |     |     |     | CP | vfmadd231ps ymm1, ymm2, ymmword ptr [r10+0x120]
|   2^   |           |     |           |           | 1.0 |     |     | 1.0 | CP | vmovaps ymmword ptr [r11+0x120], ymm1
|   1    |           |     | 1.0   1.0 |           |     |     |     |     | CP | vmovaps ymm1, ymmword ptr [r9+0x140]
|   2^   | 1.0       |     |           | 1.0   1.0 |     |     |     |     | CP | vfmadd231ps ymm1, ymm2, ymmword ptr [r10+0x140]
|   2^   |           |     |           |           | 1.0 |     |     | 1.0 | CP | vmovaps ymmword ptr [r11+0x140], ymm1
|   1    |           |     | 1.0   1.0 |           |     |     |     |     | CP | vmovaps ymm1, ymmword ptr [r9+0x160]
|   2^   |           | 1.0 |           | 1.0   1.0 |     |     |     |     | CP | vfmadd231ps ymm1, ymm2, ymmword ptr [r10+0x160]
|   2^   |           |     |           |           | 1.0 |     |     | 1.0 | CP | vmovaps ymmword ptr [r11+0x160], ymm1
|   1    |           |     | 1.0   1.0 |           |     |     |     |     | CP | vmovaps ymm1, ymmword ptr [r9+0x180]
|   2^   | 1.0       |     |           | 1.0   1.0 |     |     |     |     | CP | vfmadd231ps ymm1, ymm2, ymmword ptr [r10+0x180]
|   2^   |           |     |           |           | 1.0 |     |     | 1.0 | CP | vmovaps ymmword ptr [r11+0x180], ymm1
|   1    |           |     | 1.0   1.0 |           |     |     |     |     | CP | vmovaps ymm1, ymmword ptr [r9+0x1a0]
|   2^   |           | 1.0 |           | 1.0   1.0 |     |     |     |     | CP | vfmadd231ps ymm1, ymm2, ymmword ptr [r10+0x1a0]
|   2^   |           |     |           |           | 1.0 |     |     | 1.0 | CP | vmovaps ymmword ptr [r11+0x1a0], ymm1
|   1    |           |     | 1.0   1.0 |           |     |     |     |     | CP | vmovaps ymm1, ymmword ptr [r9+0x1c0]
|   2^   | 1.0       |     |           | 1.0   1.0 |     |     |     |     | CP | vfmadd231ps ymm1, ymm2, ymmword ptr [r10+0x1c0]
|   2^   |           |     |           |           | 1.0 |     |     | 1.0 | CP | vmovaps ymmword ptr [r11+0x1c0], ymm1
|   1    |           |     | 1.0   1.0 |           |     |     |     |     | CP | vmovaps ymm1, ymmword ptr [r9+0x1e0]
|   2^   |           | 1.0 |           | 1.0   1.0 |     |     |     |     | CP | vfmadd231ps ymm1, ymm2, ymmword ptr [r10+0x1e0]
|   2^   |           |     |           |           | 1.0 |     |     | 1.0 | CP | vmovaps ymmword ptr [r11+0x1e0], ymm1
|   1    |           |     | 1.0   1.0 |           |     |     |     |     | CP | vmovaps ymm1, ymmword ptr [r9+0x200]
|   2^   | 1.0       |     |           | 1.0   1.0 |     |     |     |     | CP | vfmadd231ps ymm1, ymm2, ymmword ptr [r10+0x200]
|   2^   |           |     |           |           | 1.0 |     |     | 1.0 | CP | vmovaps ymmword ptr [r11+0x200], ymm1
|   1    |           |     | 1.0   1.0 |           |     |     |     |     | CP | vmovaps ymm1, ymmword ptr [r9+0x220]
|   2^   |           | 1.0 |           | 1.0   1.0 |     |     |     |     | CP | vfmadd231ps ymm1, ymm2, ymmword ptr [r10+0x220]
|   2^   |           |     |           |           | 1.0 |     |     | 1.0 | CP | vmovaps ymmword ptr [r11+0x220], ymm1
|   1    |           |     | 1.0   1.0 |           |     |     |     |     | CP | vmovaps ymm1, ymmword ptr [r9+0x240]
|   2^   | 1.0       |     |           | 1.0   1.0 |     |     |     |     | CP | vfmadd231ps ymm1, ymm2, ymmword ptr [r10+0x240]
|   2^   |           |     |           |           | 1.0 |     |     | 1.0 | CP | vmovaps ymmword ptr [r11+0x240], ymm1
|   1    |           |     | 1.0   1.0 |           |     |     |     |     | CP | vmovaps ymm1, ymmword ptr [r9+0x260]
|   2^   |           | 1.0 |           | 1.0   1.0 |     |     |     |     | CP | vfmadd231ps ymm1, ymm2, ymmword ptr [r10+0x260]
|   2^   |           |     |           |           | 1.0 |     |     | 1.0 | CP | vmovaps ymmword ptr [r11+0x260], ymm1
|   1    |           |     | 1.0   1.0 |           |     |     |     |     | CP | vmovaps ymm1, ymmword ptr [r9+0x280]
|   2^   | 1.0       |     |           | 1.0   1.0 |     |     |     |     | CP | vfmadd231ps ymm1, ymm2, ymmword ptr [r10+0x280]
|   2^   |           |     |           |           | 1.0 |     |     | 1.0 | CP | vmovaps ymmword ptr [r11+0x280], ymm1
|   1    |           |     | 1.0   1.0 |           |     |     |     |     | CP | vmovaps ymm1, ymmword ptr [r9+0x2a0]
|   2^   |           | 1.0 |           | 1.0   1.0 |     |     |     |     | CP | vfmadd231ps ymm1, ymm2, ymmword ptr [r10+0x2a0]
|   2^   |           |     |           |           | 1.0 |     |     | 1.0 | CP | vmovaps ymmword ptr [r11+0x2a0], ymm1
|   1    |           |     | 1.0   1.0 |           |     |     |     |     | CP | vmovaps ymm1, ymmword ptr [r9+0x2c0]
|   2^   | 1.0       |     |           | 1.0   1.0 |     |     |     |     | CP | vfmadd231ps ymm1, ymm2, ymmword ptr [r10+0x2c0]
|   2^   |           |     |           |           | 1.0 |     |     | 1.0 | CP | vmovaps ymmword ptr [r11+0x2c0], ymm1
|   1    |           |     | 1.0   1.0 |           |     |     |     |     | CP | vmovaps ymm1, ymmword ptr [r9+0x2e0]
|   2^   |           | 1.0 |           | 1.0   1.0 |     |     |     |     | CP | vfmadd231ps ymm1, ymm2, ymmword ptr [r10+0x2e0]
|   2^   |           |     |           |           | 1.0 |     |     | 1.0 | CP | vmovaps ymmword ptr [r11+0x2e0], ymm1
|   1    |           |     | 1.0   1.0 |           |     |     |     |     | CP | vmovaps ymm1, ymmword ptr [r9+0x300]
|   2^   | 1.0       |     |           | 1.0   1.0 |     |     |     |     | CP | vfmadd231ps ymm1, ymm2, ymmword ptr [r10+0x300]
|   2^   |           |     |           |           | 1.0 |     |     | 1.0 | CP | vmovaps ymmword ptr [r11+0x300], ymm1
|   1    |           |     | 1.0   1.0 |           |     |     |     |     | CP | vmovaps ymm1, ymmword ptr [r9+0x320]
|   2^   |           | 1.0 |           | 1.0   1.0 |     |     |     |     | CP | vfmadd231ps ymm1, ymm2, ymmword ptr [r10+0x320]
|   2^   |           |     |           |           | 1.0 |     |     | 1.0 | CP | vmovaps ymmword ptr [r11+0x320], ymm1
|   1    |           |     | 1.0   1.0 |           |     |     |     |     | CP | vmovaps ymm1, ymmword ptr [r9+0x340]
|   2^   | 1.0       |     |           | 1.0   1.0 |     |     |     |     | CP | vfmadd231ps ymm1, ymm2, ymmword ptr [r10+0x340]
|   2^   |           |     |           |           | 1.0 |     |     | 1.0 | CP | vmovaps ymmword ptr [r11+0x340], ymm1
|   1    |           |     | 1.0   1.0 |           |     |     |     |     | CP | vmovaps ymm1, ymmword ptr [r9+0x360]
|   2^   |           | 1.0 |           | 1.0   1.0 |     |     |     |     | CP | vfmadd231ps ymm1, ymm2, ymmword ptr [r10+0x360]
|   2^   |           |     |           |           | 1.0 |     |     | 1.0 | CP | vmovaps ymmword ptr [r11+0x360], ymm1
|   1    |           |     | 1.0   1.0 |           |     |     |     |     | CP | vmovaps ymm1, ymmword ptr [r9+0x380]
|   2^   | 1.0       |     |           | 1.0   1.0 |     |     |     |     | CP | vfmadd231ps ymm1, ymm2, ymmword ptr [r10+0x380]
|   2^   |           |     |           |           | 1.0 |     |     | 1.0 | CP | vmovaps ymmword ptr [r11+0x380], ymm1
|   1    |           |     | 1.0   1.0 |           |     |     |     |     | CP | vmovaps ymm1, ymmword ptr [r9+0x3a0]
|   2^   |           | 1.0 |           | 1.0   1.0 |     |     |     |     | CP | vfmadd231ps ymm1, ymm2, ymmword ptr [r10+0x3a0]
|   2^   |           |     |           |           | 1.0 |     |     | 1.0 | CP | vmovaps ymmword ptr [r11+0x3a0], ymm1
|   1    |           |     | 1.0   1.0 |           |     |     |     |     | CP | vmovaps ymm1, ymmword ptr [r9+0x3c0]
|   2^   | 1.0       |     |           | 1.0   1.0 |     |     |     |     | CP | vfmadd231ps ymm1, ymm2, ymmword ptr [r10+0x3c0]
|   2^   |           |     |           |           | 1.0 |     |     | 1.0 | CP | vmovaps ymmword ptr [r11+0x3c0], ymm1
|   1    |           |     | 1.0   1.0 |           |     |     |     |     | CP | vmovaps ymm1, ymmword ptr [r9+0x3e0]
|   2^   |           | 1.0 |           | 1.0   1.0 |     |     |     |     | CP | vfmadd231ps ymm1, ymm2, ymmword ptr [r10+0x3e0]
|   2^   |           |     |           |           | 1.0 |     |     | 1.0 | CP | vmovaps ymmword ptr [r11+0x3e0], ymm1
|   1    |           |     |           |           |     | 1.0 |     |     |    | add r9, 0x400
|   1    |           |     |           |           |     |     | 1.0 |     |    | add r10, 0x400
|   1    |           |     |           |           |     | 1.0 |     |     |    | add r11, 0x400
|   1    |           |     |           |           |     |     | 1.0 |     |    | cmp r9, rcx
|   0F   |           |     |           |           |     |     |     |     |    | jnz 0xfffffffffffffcaf
Total Num Of Uops: 164

我们在这里看到了商店到端口 7 的微融合和正确调度。

手动分析(见上面的编辑)

我现在可以回答你的第二个问题: 这是否可以不展开，如果可以，怎么做？ .答案是不。

我填充了数组 x , y和 z为下面的实验提供了足够的缓冲区，并将内部循环更改为以下内容:

.L2:
vmovaps         ymm1, [rdi+rax] ; 1L
vmovaps         ymm0, [rsi+rax] ; 2L
vmovaps         [rdx+rax], ymm2 ; S1
add             rax, 32         ; ADD
jne             .L2             ; JMP

这故意不使用 FMA(仅加载和存储)并且所有加载/存储指令都没有依赖性，因为因此应该没有任何危险，无论阻止它们进入任何执行端口。

然后我测试了第一次和第二次加载( 1L 和 2L )、存储( S1 )和添加( A )的每一个排列，同时在最后留下条件跳转( J ) ，并且对于这些中的每一个，我测试了 x 的所有可能的偏移组合。 , y和 z按 0 或 -32 字节(纠正在 add rax, 32 索引之一之前重新排序 r+r 会导致加载或存储定位到错误地址的事实)。循环对齐到 32 个字节。测试在 2.4GHz i7-4700MQ 上运行，通过 echo '0' > /sys/devices/system/cpu/cpufreq/boost 禁用 TurboBoost在 Linux 下，频率常数使用 2.4。以下是效率结果(最多 24 个):

Cases: 0           1           2           3           4           5           6           7
       L1  L2  S   L1  L2  S   L1  L2  S   L1  L2  S   L1  L2  S   L1  L2  S   L1  L2  S   L1  L2  S   
       -0  -0  -0  -0  -0  -32 -0  -32 -0  -0  -32 -32 -32 -0  -0  -32 -0  -32 -32 -32 -0  -32 -32 -32
       ________________________________________________________________________________________________
12SAJ: 65.34%      65.34%      49.63%      65.07%      49.70%      65.05%      49.22%      65.07%
12ASJ: 48.59%      64.48%      48.74%      49.69%      48.75%      49.69%      48.99%      48.60%
1A2SJ: 49.69%      64.77%      48.67%      64.06%      49.69%      49.69%      48.94%      49.69%
1AS2J: 48.61%      64.66%      48.73%      49.71%      48.77%      49.69%      49.05%      48.74%
1S2AJ: 49.66%      65.13%      49.49%      49.66%      48.96%      64.82%      49.02%      49.66%
1SA2J: 64.44%      64.69%      49.69%      64.34%      49.69%      64.41%      48.75%      64.14%
21SAJ: 65.33%*     65.34%      49.70%      65.06%      49.62%      65.07%      49.22%      65.04%
21ASJ: Hypothetically =12ASJ
2A1SJ: Hypothetically =1A2SJ
2AS1J: Hypothetically =1AS2J
2S1AJ: Hypothetically =1S2AJ
2SA1J: Hypothetically =1SA2J
S21AJ: 48.91%      65.19%      49.04%      49.72%      49.12%      49.63%      49.21%      48.95%
S2A1J: Hypothetically =S1A2J
SA21J: Hypothetically =SA12J
SA12J: 64.69%      64.93%      49.70%      64.66%      49.69%      64.27%      48.71%      64.56%
S12AJ: 48.90%      65.20%      49.12%      49.63%      49.03%      49.70%      49.21%*     48.94%
S1A2J: 49.69%      64.74%      48.65%      64.48%      49.43%      49.69%      48.66%      49.69%
A2S1J: Hypothetically =A1S2J
A21SJ: Hypothetically =A12SJ
A12SJ: 64.62%      64.45%      49.69%      64.57%      49.69%      64.45%      48.58%      63.99%
A1S2J: 49.72%      64.69%      49.72%      49.72%      48.67%      64.46%      48.95%      49.72%
AS21J: Hypothetically =AS21J
AS12J: 48.71%      64.53%      48.76%      49.69%      48.76%      49.74%      48.93%      48.69%

我们可以从表中注意到一些事情:

结果有几个高原，但只有两个主要的:不到 50% 和大约 65%。

L1 和 L2 可以在彼此之间自由置换而不影响结果。

将访问偏移 -32 字节可以改变效率。

我们感兴趣的模式(加载 1、加载 2、存储 1 和在它们周围任意位置添加添加和正确应用 -32 偏移)都是相同的，并且都在更高的平台上:

12SAJ案例 0(未应用偏移)，效率为 65.34%(最高)

12ASJ案例 1 ( S-32 )，效率为 64.48%

1A2SJ案例 3 ( 2L-32 , S-32 )，效率为 64.06%

A12SJ案例 7 ( 1L-32 , 2L-32 , S-32 )，效率为 63.99%

对于允许在更高的效率平台上执行的每个排列，总是存在至少一个“案例”。特别是，案例 1(其中 S-32 )似乎可以保证这一点。

案例 2、4 和 6 保证在较低的平台上执行。它们的共同点是，其中一个或两个负载都偏移了 -32，而存储则没有。

对于情况 0、3、5 和 7，这取决于排列。

由此我们至少可以得出几个结论:

执行端口 2 和 3 真的不关心它们生成和加载的加载地址。

宏操作融合add和 jmp似乎不受任何指令排列的影响(特别是在案例 1 抵消下)，这让我相信@Evgeny Kluev 的结论是不正确的:add 的距离来自 jne似乎不会影响它们的融合。我现在有理由确定 Haswell ROB 正确处理了这个问题。

Evgeny 所看到的(从 12SAJ 效率为 65% 到其他情况，在案例 0 中效率为 49%)只是由于加载和存储的地址的值造成的影响，而不是由于核心的无能宏融合添加和分支。

此外，宏操作融合必须至少在某些时间发生，因为平均循环时间为 1.5 CC。如果宏操作融合没有发生，这将是 2CC 最小值。

在未展开的循环中测试了所有有效和无效的指令排列后，我们没有看到高于 65.34% 的情况。这凭经验回答了是否可以在不展开的情况下使用全带宽的问题。

我将假设几种可能的解释:

由于地址相对于彼此的值(value)，我们看到了一些奇怪的变态。

如果是这样，那么将存在一组 x 的偏移量。 , y和 z这将允许最大吞吐量。我的快速随机测试似乎不支持这一点。

我们看到循环以一两步模式运行；循环迭代在一个时钟周期内交替运行，然后是两个。

这可能是受解码器影响的宏操作融合。来自阿格纳雾:

无法在 Sandy Bridge 和 Ivy Bridge 处理器上的四个解码器中的最后一个解码可融合算术/逻辑指令。我还没有测试这是否也适用于 Haswell。

或者，每隔一个时钟周期就会向“错误”端口发出一条指令，从而阻止下一次迭代一个额外的时钟周期。这种情况会在下一个时钟周期内自我纠正，但会保持振荡。

如果有人可以访问英特尔性能计数器，他应该查看事件 UOPS_EXECUTED_PORT.PORT_[0-7] .如果没有发生振荡，所有使用的端口将在相关的时间段内被平等地固定；否则，如果发生振荡，将有 50% 的 split 。尤其重要的是查看 Mystical 指出的端口(0、1、6 和 7)。

这就是我认为没有发生的事情:

我不相信融合算术+分支 uop 会通过转到端口 0 来阻止执行，因为预测采取的分支仅发送到端口 6(请参阅 Haswell -> Control transfer instructions 下的 Agner Fog 指令表)。在上述循环的几次迭代之后，分支预测器将知道这个分支是一个循环，并且总是按所采取的方式进行预测。

我相信这是一个可以用英特尔的性能计数器解决的问题。

关于c - 在 L1 缓存 : only getting 62% 中获取 Haswell 上的峰值带宽，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/25899395/

文章推荐： java - 避免算法中的浮点精度误差

文章推荐： java - 使用 FileWriter 在 java 中覆盖文件

文章推荐： ios - 调整 UIView 的约束。迅速

intel - Haswell 微架构在性能中没有停滞周期后端
我在 Haswell CPU(Intel Core i7-4790)上安装了 perf。但“性能列表”不包括“stalled-cycles-frontend”或“stalled-cycles-back
sse - 使用 Haswell 架构的并行编程
关闭。这个问题不满足Stack Overflow guidelines .它目前不接受答案。想改善这个问题吗？更新问题，使其成为 on-topic对于堆栈溢出。 7年前关闭。 Improve thi
simd - Haswell FMA 指令生成异常
我正在使用 Intel Haswell CPU 的 FMA 指令来优化一些计算。但是，我发现即使我将 MXCSR 寄存器设置为 DNZ 和 FTZ 模式，这些指令也会生成异常。我如何强制这些 FM
compiler-construction - 目前哪些编译器支持 Haswell 事务内存？
哪些编译器(截至 2014 年 5 月)能够生成使用事务内存功能(受限事务内存，而不仅仅是锁省略)的代码？最佳答案 GCC，截至 version 4.8支持英特尔 RTM: Support for
intel - 了解 CYCLE_ACTIVITY.* Haswell 性能监控事件
我正在尝试使用自上而下的微架构分析方法 (TMAM) 来分析 Intel Haswell CPU (Intel® Core™ i7-4900MQ) 上的执行情况，如 Intel® 64 and IA-
x86 - Haswell 内核可以同时执行多少个 32 位整数运算？
在准备一些演示文稿时，我突然想到，我不知道 Haswell 内核一次可以执行的整数运算数量的理论限制是多少。我曾经天真地假设“Intel 内核具有 HT，但这可能会并行化不同类型的工作，因此内核可能
windows - 我怎么知道 CPU 是不是 Haswell
要知道，haswell是英特尔作为Ivy Bridge微架构的“第四代核心”继承者而开发的一种处理器微架构的代号。 1英特尔正式发布了基于这种微架构的CPU... More 但是，我想知道如何通过在
assembly - Haswell/Skylake 上的部分寄存器究竟如何执行？写AL好像对RAX有假依赖，AH不一致
这个循环在 Intel Conroe/Merom 上每 3 个周期运行一次迭代，如预期的那样在 imul 吞吐量上出现瓶颈。但是在 Haswell/Skylake 上，它每 11 个周期运行一次迭代，
memory - 新的 Haswell AVX "gather"指令有哪些对齐限制？
我正在查看AVX programming reference 。 new Haswell instructions包括一些期待已久的“聚集”负载。但是，我无法弄清楚索引数据项的对齐限制是什么。引用文献
c++ - 在非 Haswell 处理器上禁用 AVX2 功能
我编写了一些在 Haswell i7 处理器上运行的 AVX2 代码。相同的代码库也用于非 Haswell 处理器，其中相同的代码应替换为它们的 SSE 等效项。我想知道编译器是否有办法忽略非 Has
performance - Haswell AVX/FMA 延迟测试比英特尔指南慢 1 个周期
在英特尔内部函数指南中，vmulpd和vfmadd213pd延迟为 5，vaddpd延迟为 3。我编写了一些测试代码，但所有结果都慢了 1 个周期。这是我的测试代码: .CODE test_lat
performance - Haswell AVX/FMA 延迟测试比英特尔指南慢 1 个周期
在英特尔内部函数指南中，vmulpd和vfmadd213pd延迟为 5，vaddpd延迟为 3。我编写了一些测试代码，但所有结果都慢了 1 个周期。这是我的测试代码: .CODE test_lat
c++ - 在 haswell RTM 中使用的 isLocked 方法
我目前正在使用 Intel Haswell RTM(事务内存的硬件支持)开发应用程序。据我所知here和 here ，建议的过程是使用某种回退锁，以防事务中止。推荐流程如下: someTypeOfL
C 程序 77% 的时间花在 _platform_memmove$VARIANT$Haswell
关闭。这个问题需要debugging details .它目前不接受答案。编辑问题以包含 desired behavior, a specific problem or error, and th
c++ - 使用 haswell tsx 的神秘 rtm 中止
我正在 haswell 中试验 tsx 扩展，通过调整现有的中型(1000 行)代码库以使用 GCC 事务内存扩展(在 native 中间接使用 haswell tsx)而不是粗粒度锁。我正在使用 G
c++ - AVX2 在 Haswell 上比 SSE 慢
我有以下代码(正常、SSE 和 AVX): int testSSE(const aligned_vector & ghs, const aligned_vector & lhs) { int
c - 除非 Haswell 指定，否则 GCC 编译前导零计数很差
GCC 支持 __builtin_clz(int x) 内置函数，它计算参数中前导零(连续的最高有效零)的数量。除其他外0，这对于有效实现 lg(unsigned int x) 非常有用函数，取 x
c++ - 使用 Ivy Bridge 和 Haswell 循环展开以实现最大吞吐量
我正在使用 AVX 一次计算八个点积。在我当前的代码中，我做了这样的事情(在展开之前): Ivy 桥/沙桥 __m256 areg0 = _mm256_set1_ps(a[m]); for(int i
cpu - 沙桥和 haswell SSE2/AVX/AVX2 的每个周期的 FLOPS
我对使用 Sandy-Bridge 和 Haswell 可以完成每个内核每个周期的触发器感到困惑。根据我对 SSE 的理解，SSE 的每个内核每个周期应该是 4 个触发器，AVX/AVX2 的每个内
performance - SSE2 8x8 字节矩阵转置代码在 Haswell+ 上比在 Ivy 桥上慢两倍
我编写了很多 punpckl、pextrd 和 pinsrd 的代码，它们旋转 8x8 字节矩阵，作为使用循环平铺旋转 B/W 图像的更大例程的一部分。我使用 IACA 对其进行了分析，以查看是否值

行者123

个人简介

我是一名优秀的程序员,十分优秀！

作者热门文章

滴滴打车优惠券免费领取

全站热门文章

首页

博学

6Ren·AI

商城

c - 在 L1 缓存 : only getting 62% 中获取 Haswell 上的峰值带宽