gpt4 book ai didi

assembly - 用于 AVX 加载/存储指令的 Intel Broadwell uop fusion

转载 作者:行者123 更新时间:2023-12-01 11:27:53 26 4
gpt4 key购买 nike

我正在尝试确定内存绑定(bind)矢量化循环的性能基线。我在 32 字节对齐的环境中使用带有 AVX2 指令的 Intel Broadwell 芯片执行此操作。

基线循环一次使用 8 个 YMM 寄存器从一个位置加载并非临时存储到另一个位置:

%define ptr
%define ymmword yword
%define SIZE 16777216*8 ;; array size >> LLC

align 32 ;; avx2 vector alignement

global _ls_01_opt

section .text

_ls_01_opt: ;rdi is input, rsi output
push rbp
mov rbp,rsp

xor rax,rax

mov ebx, 111 ; IACA PREFIX
db 0x64, 0x67, 0x90 ;

LOOP0:
vmovapd ymm0, ymmword ptr [ (32) + rdi +8*rax]
vmovapd ymm2, ymmword ptr [ (64) + rdi +8*rax]
vmovapd ymm4, ymmword ptr [ (96) + rdi +8*rax]
vmovapd ymm6, ymmword ptr [ (128) + rdi +8*rax]

vmovapd ymm8, ymmword ptr [ (160) + rdi +8*rax]
vmovapd ymm10, ymmword ptr [ (192) + rdi +8*rax]
vmovapd ymm12, ymmword ptr [ (224) + rdi +8*rax]
vmovapd ymm14, ymmword ptr [ (256) + rdi +8*rax]

vmovntpd ymmword ptr [ (32) + rsi +8*rax], ymm0
vmovntpd ymmword ptr [ (64) + rsi +8*rax], ymm2
vmovntpd ymmword ptr [ (96) + rsi +8*rax], ymm4
vmovntpd ymmword ptr [ (128) + rsi +8*rax], ymm6

vmovntpd ymmword ptr [ (160) + rsi +8*rax], ymm8
vmovntpd ymmword ptr [ (192) + rsi +8*rax], ymm10
vmovntpd ymmword ptr [ (224) + rsi +8*rax], ymm12
vmovntpd ymmword ptr [ (256) + rsi +8*rax], ymm14

add rax, (4*8)
cmp rax, SIZE
jne LOOP0


mov ebx, 222 ; IACA SUFFIX
db 0x64, 0x67, 0x90 ;

ret

我用 YASM 组装它,然后用英特尔架构代码分析器 (IACA) 进行测试,它告诉我:
Throughput Analysis Report
--------------------------
Block Throughput: 8.00 Cycles Throughput Bottleneck: PORT2_AGU, PORT3_AGU, Port4

Port Binding In Cycles Per Iteration:
---------------------------------------------------------------------------------------
| Port | 0 - DV | 1 | 2 - D | 3 - D | 4 | 5 | 6 | 7 |
---------------------------------------------------------------------------------------
| Cycles | 0.5 0.0 | 0.5 | 8.0 4.0 | 8.0 4.0 | 8.0 | 0.5 | 0.5 | 0.0 |
---------------------------------------------------------------------------------------

N - port number or number of cycles resource conflict caused delay, DV - Divider pipe (on port 0)
D - Data fetch pipe (on ports 2 and 3), CP - on a critical path
F - Macro Fusion with the previous instruction occurred
* - instruction micro-ops not bound to a port
^ - Micro Fusion happened
# - ESP Tracking sync uop was issued
@ - SSE instruction followed an AVX256 instruction, dozens of cycles penalty is expected
! - instruction not supported, was not accounted in Analysis

| Num Of | Ports pressure in cycles | |
| Uops | 0 - DV | 1 | 2 - D | 3 - D | 4 | 5 | 6 | 7 | |
---------------------------------------------------------------------------------
| 1 | | | 1.0 1.0 | | | | | | CP | vmovapd ymm0, ymmword ptr [rdi+rax*8+0x20]
| 1 | | | | 1.0 1.0 | | | | | CP | vmovapd ymm2, ymmword ptr [rdi+rax*8+0x40]
| 1 | | | 1.0 1.0 | | | | | | CP | vmovapd ymm4, ymmword ptr [rdi+rax*8+0x60]
| 1 | | | | 1.0 1.0 | | | | | CP | vmovapd ymm6, ymmword ptr [rdi+rax*8+0x80]
| 1 | | | 1.0 1.0 | | | | | | CP | vmovapd ymm8, ymmword ptr [rdi+rax*8+0xa0]
| 1 | | | | 1.0 1.0 | | | | | CP | vmovapd ymm10, ymmword ptr [rdi+rax*8+0xc0]
| 1 | | | 1.0 1.0 | | | | | | CP | vmovapd ymm12, ymmword ptr [rdi+rax*8+0xe0]
| 1 | | | | 1.0 1.0 | | | | | CP | vmovapd ymm14, ymmword ptr [rdi+rax*8+0x100]
| 2 | | | 1.0 | | 1.0 | | | | CP | vmovntpd ymmword ptr [rsi+rax*8+0x20], ymm0
| 2 | | | | 1.0 | 1.0 | | | | CP | vmovntpd ymmword ptr [rsi+rax*8+0x40], ymm2
| 2 | | | 1.0 | | 1.0 | | | | CP | vmovntpd ymmword ptr [rsi+rax*8+0x60], ymm4
| 2 | | | | 1.0 | 1.0 | | | | CP | vmovntpd ymmword ptr [rsi+rax*8+0x80], ymm6
| 2 | | | 1.0 | | 1.0 | | | | CP | vmovntpd ymmword ptr [rsi+rax*8+0xa0], ymm8
| 2 | | | | 1.0 | 1.0 | | | | CP | vmovntpd ymmword ptr [rsi+rax*8+0xc0], ymm10
| 2 | | | 1.0 | | 1.0 | | | | CP | vmovntpd ymmword ptr [rsi+rax*8+0xe0], ymm12
| 2 | | | | 1.0 | 1.0 | | | | CP | vmovntpd ymmword ptr [rsi+rax*8+0x100], ymm14
| 1 | | 0.5 | | | | 0.5 | | | | add rax, 0x20
| 1 | 0.5 | | | | | | 0.5 | | | cmp rax, 0x8000000
| 0F | | | | | | | | | | jnz 0xffffffffffffff78

我的印象是,我可以一次获得 2 倍的负载,同时在端口 2 和 3 上加载 Broadwell。为什么没有发生这种情况?

谢谢

更新

根据以下建议,将 pd 替换为 ps,并将地址合并到一个寄存器中,新代码如下所示:
%define ptr
%define ymmword yword
%define SIZE 16777216*8 ;; array size >> LLC

align 32 ;; avx2 vector alignement

global _ls_01_opt

section .text

_ls_01_opt: ;rdi is input, rsi output
push rbp
mov rbp,rsp

xor rax,rax
xor rbx,rbx
xor rcx,rcx

or rbx, rdi
or rcx, rsi


mov ebx, 111 ; IACA PREFIX
db 0x64, 0x67, 0x90 ;

LOOP0:
vmovaps ymm0, ymmword ptr [ (32) + rbx ]
vmovaps ymm2, ymmword ptr [ (64) + rbx ]
vmovaps ymm4, ymmword ptr [ (96) + rbx ]
vmovaps ymm6, ymmword ptr [ (128) + rbx ]

vmovaps ymm8, ymmword ptr [ (160) + rbx ]
vmovaps ymm10, ymmword ptr [ (192) + rbx ]
vmovaps ymm12, ymmword ptr [ (224) + rbx ]
vmovaps ymm14, ymmword ptr [ (256) + rbx ]

vmovntps ymmword ptr [ (32) + rcx], ymm0
vmovntps ymmword ptr [ (64) + rcx], ymm2
vmovntps ymmword ptr [ (96) + rcx], ymm4
vmovntps ymmword ptr [ (128) + rcx], ymm6

vmovntps ymmword ptr [ (160) + rcx], ymm8
vmovntps ymmword ptr [ (192) + rcx], ymm10
vmovntps ymmword ptr [ (224) + rcx], ymm12
vmovntps ymmword ptr [ (256) + rcx], ymm14

add rax, (4*8)
add rbx, (4*8*8)
add rcx, (4*8*8)
cmp rax, SIZE
jne LOOP0


mov ebx, 222 ; IACA SUFFIX
db 0x64, 0x67, 0x90 ;

ret

然后 IACA 告诉我:
Throughput Analysis Report
--------------------------
Block Throughput: 8.00 Cycles Throughput Bottleneck: Port4

Port Binding In Cycles Per Iteration:
---------------------------------------------------------------------------------------
| Port | 0 - DV | 1 | 2 - D | 3 - D | 4 | 5 | 6 | 7 |
---------------------------------------------------------------------------------------
| Cycles | 1.0 0.0 | 1.0 | 5.3 4.0 | 5.3 4.0 | 8.0 | 1.0 | 1.0 | 5.3 |
---------------------------------------------------------------------------------------

N - port number or number of cycles resource conflict caused delay, DV - Divider pipe (on port 0)
D - Data fetch pipe (on ports 2 and 3), CP - on a critical path
F - Macro Fusion with the previous instruction occurred
* - instruction micro-ops not bound to a port
^ - Micro Fusion happened
# - ESP Tracking sync uop was issued
@ - SSE instruction followed an AVX256 instruction, dozens of cycles penalty is expected
! - instruction not supported, was not accounted in Analysis

| Num Of | Ports pressure in cycles | |
| Uops | 0 - DV | 1 | 2 - D | 3 - D | 4 | 5 | 6 | 7 | |
---------------------------------------------------------------------------------
| 1 | | | 1.0 1.0 | | | | | | | vmovaps ymm0, ymmword ptr [rbx+0x20]
| 1 | | | | 1.0 1.0 | | | | | | vmovaps ymm2, ymmword ptr [rbx+0x40]
| 1 | | | 1.0 1.0 | | | | | | | vmovaps ymm4, ymmword ptr [rbx+0x60]
| 1 | | | | 1.0 1.0 | | | | | | vmovaps ymm6, ymmword ptr [rbx+0x80]
| 1 | | | 1.0 1.0 | | | | | | | vmovaps ymm8, ymmword ptr [rbx+0xa0]
| 1 | | | | 1.0 1.0 | | | | | | vmovaps ymm10, ymmword ptr [rbx+0xc0]
| 1 | | | 1.0 1.0 | | | | | | | vmovaps ymm12, ymmword ptr [rbx+0xe0]
| 1 | | | | 1.0 1.0 | | | | | | vmovaps ymm14, ymmword ptr [rbx+0x100]
| 2^ | | | | | 1.0 | | | 1.0 | CP | vmovntps ymmword ptr [rcx+0x20], ymm0
| 2^ | | | | | 1.0 | | | 1.0 | CP | vmovntps ymmword ptr [rcx+0x40], ymm2
| 2^ | | | | | 1.0 | | | 1.0 | CP | vmovntps ymmword ptr [rcx+0x60], ymm4
| 2^ | | | | | 1.0 | | | 1.0 | CP | vmovntps ymmword ptr [rcx+0x80], ymm6
| 2^ | | | 0.3 | 0.3 | 1.0 | | | 0.3 | CP | vmovntps ymmword ptr [rcx+0xa0], ymm8
| 2^ | | | 0.3 | 0.3 | 1.0 | | | 0.3 | CP | vmovntps ymmword ptr [rcx+0xc0], ymm10
| 2^ | | | 0.3 | 0.3 | 1.0 | | | 0.3 | CP | vmovntps ymmword ptr [rcx+0xe0], ymm12
| 2^ | | | 0.3 | 0.3 | 1.0 | | | 0.3 | CP | vmovntps ymmword ptr [rcx+0x100], ymm14
| 1 | 1.0 | | | | | | | | | add rax, 0x20
| 1 | | 1.0 | | | | | | | | add rbx, 0x100
| 1 | | | | | | 1.0 | | | | add rcx, 0x100
| 1 | | | | | | | 1.0 | | | cmp rax, 0x8000000
| 0F | | | | | | | | | | jnz 0xffffffffffffff7a

这告诉我商店现在可以使用端口 7 作为地址并且操作已存储。 IACA 告诉我,“ block 吞吐量”仍然是 8 次操作,因为需要额外操作才能将地址​​放到单个寄存器上。也许我做错了?

我还是不明白为什么加载操作不能融合

最佳答案

port7 上的 store-AGU 只能处理“简单”有效地址,因此您的商店还需要加载端口上的 AGU。 IACA 确实显示您的负载实际上并未相互竞争;竞争的是商店。

请注意,对于 MOVNT 存储,每个核心只有约 10 个填充缓冲区,因此这些缓冲区将很快填满并成为瓶颈。

另见 Micro fusion and addressing modes .如果您为它们使用单​​寄存器寻址模式,您的商店可以进行微融合并减少融合域的微指令。

另外,我想 VEX 编码指令无关紧要,但 SSE pd版本需要一个额外的 x86 机器代码字节。 clang倾向于使用movaps用于加载/存储,因为它更短,即使在整数向量上也是如此。每个现有 CPU 运行 movaps/movapd相同。所以我建议只使用 vmovaps/vmovntps .不过,它根本不会有任何区别。在 VEX 前缀中只少了一个设置位。

关于assembly - 用于 AVX 加载/存储指令的 Intel Broadwell uop fusion,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/35819873/

26 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com