gpt4 book ai didi

performance - 现代 x86 实现是否可以从多个先前存储进行存储转发?

转载 作者:行者123 更新时间:2023-12-03 05:16:29 26 4
gpt4 key购买 nike

如果负载与两个较早的存储重叠(并且负载未完全包含在最旧的存储中),现代 Intel 或 AMD x86 实现能否从两个存储转发以满足负载?

例如,考虑以下序列:

mov [rdx + 0], eax
mov [rdx + 2], eax
mov ax, [rdx + 1]

最后的 2 字节加载从前一个存储中获取第二个字节,但从之前的存储中获取第一个字节。此负载是否可以存储转发,或者是否需要等到两个先前的存储都提交到 L1?

请注意,通过存储转发,我在这里包含了任何可以满足从存储缓冲区中的存储读取的机制,而不是等待它们提交到 L1,即使它是一个比最佳情况“从单个商店转发”情况慢的路径。

最佳答案

没有。

至少,在 Haswell、Broadwell 或 Skylake 处理器上不是这样。在其他英特尔处理器上,限制要么类似(Sandy Bridge、Ivy Bridge),要么甚至更严格(Nehalem、Westmere、Pentium Pro/II/II/4)。在 AMD 上,也存在类似的限制。

来自 Agner Fog 的精彩 optimization manuals :

哈斯韦尔/布罗德韦尔

The microarchitecture of Intel and AMD CPUs

§ 10.12 Store forwarding stalls

The processor can forward a memory write to a subsequent read from the same address under certain conditions. Store forwarding works in the following cases:

  • When a write of 64 bits or less is followed by a read of the same size and the same address, regardless of alignment.
  • When a write of 128 or 256 bits is followed by a read of the same size and the same address, fully aligned.
  • When a write of 64 bits or less is followed by a read of a smaller size which is fully contained in the write address range, regardless of alignment.
  • When an aligned write of any size is followed by two reads of the two halves, or four reads of the four quarters, etc. with their natural alignment within the write address range.
  • When an aligned write of 128 bits or 256 bits is followed by a read of 64 bits or less that does not cross an 8 bytes boundary.

A delay of 2 clocks occur if the memory block crosses a 64-bytes cache line boundary. This can be avoided if all data have their natural alignment.

Store forwarding fails in the following cases:

  • When a write of any size is followed by a read of a larger size
  • When a write of any size is followed by a partially overlapping read
  • When a write of 128 bits is followed by a smaller read crossing the boundary between the two 64-bit halves
  • When a write of 256 bits is followed by a 128 bit read crossing the boundary between the two 128-bit halves
  • When a write of 256 bits is followed by a read of 64 bits or less crossing any boundary between the four 64-bit quarters

A failed store forwarding takes 10 clock cycles more than a successful store forwarding. The penalty is much higher - approximately 50 clock cycles - after a write of 128 or 256 bits which is not aligned by at least 16.

已添加强调

天湖

The microarchitecture of Intel and AMD CPUs

§ 11.12 Store forwarding stalls

The Skylake processor can forward a memory write to a subsequent read from the same address under certain conditions. Store forwarding is one clock cycle faster than on previous processors. A memory write followed by a read from the same address takes 4 clock cycles in the best case for operands of 32 or 64 bits, and 5 clock cycles for other operand sizes.

Store forwarding has a penalty of up to 3 clock cycles extra when an operand of 128 or 256 bits is misaligned.

A store forwarding usually takes 4 - 5 clock cycles extra when an operand of any size crosses a cache line boundary, i.e. an address divisible by 64 bytes.

A write followed by a smaller read from the same address has little or no penalty.

A write of 64 bits or less followed by a smaller read has a penalty of 1 - 3 clocks when the read is offset but fully contained in the address range covered by the write.

An aligned write of 128 or 256 bits followed by a read of one or both of the two halves or the four quarters, etc., has little or no penalty. A partial read that does not fit into the halves or quarters can take 11 clock cycles extra.

A read that is bigger than the write, or a read that covers both written and unwritten bytes, takes approximately 11 clock cycles extra.

已添加强调

一般情况:

Agner Fog 的文档指出,微体系结构的一个共同点是,如果写入对齐并且读取是的一半四分之一,则更有可能发生存储转发书面值。

测试

使用以下紧密循环进行测试:

mov [rsp-16], eax
mov [rsp-12], ebx
mov ecx, [rsp-15]

显示ld_blocks.store_forward PMU 计数器确实增加。该事件记录如下:

ld_blocks.store_forward [This event counts how many times the load operation got the true Block-on-Store blocking code preventing store forwarding. This includes cases when: - preceding store conflicts with the load (incomplete overlap)

  • store forwarding is impossible due to u-arch limitations

  • preceding lock RMW operations are not forwarded

  • store has the no-forward bit set (uncacheable/page-split/masked stores)

  • all-blocking stores are used (mostly, fences and port I/O)

这表明,当读取仅部分重叠最近的较早存储时(即使在考虑更早的存储时它被完全包含),存储转发确实会失败。

关于performance - 现代 x86 实现是否可以从多个先前存储进行存储转发?,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/46135766/

26 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com