gpt4 book ai didi

multithreading - 锁xchg是否具有与mfence相同的行为?

转载 作者:行者123 更新时间:2023-12-03 12:43:37 25 4
gpt4 key购买 nike

我想知道的是,从一个线程访问其他线程正在变异的内存位置(让我们随机说)的角度来看,lock xchg是否具有与mfence类似的行为。是否可以保证我获得了最新的值(value)?接下来的内存读/写指令?

我感到困惑的原因是:

8.2.2 “Reads or writes cannot be reordered with I/O instructions, locked instructions, or serializing instructions.”

-Intel 64 Developers Manual Vol. 3



这是否适用于所有线程?
mfence状态:

Performs a serializing operation on all load-from-memory and store-to-memory instructions that were issued prior the MFENCE instruction. This serializing operation guarantees that every load and store instruction that precedes in program order the MFENCE instruction is globally visible before any load or store instruction that follows the MFENCE instruction is globally visible. The MFENCE instruction is ordered with respect to all load and store instructions, other MFENCE instructions, any SFENCE and LFENCE instructions, and any serializing instructions (such as the CPUID instruction).

-Intel 64 Developers Manual Vol 3A



这听起来像是更有力的保证。听起来像 mfence几乎要刷新写缓冲区,或者至少伸向写缓冲区和其他内核,以确保我将来的加载/存储是最新的。

标记为基准时,两个指令需要大约100个周期才能完成。所以我看不到有什么大的不同。

首先,我只是感到困惑。我基于互斥锁中使用的 lock进行了指令,但是这些指令不包含任何内存限制。然后,我看到使用内存栅栏但没有锁的无锁编程。我了解AMD64具有非常强大的内存模型,但是过时的值可以保留在缓存中。如果 lock的行为与 mfence不同,那么互斥锁如何帮助您查看最新值?

最佳答案

我相信您的问题与询问mfence是否具有与x86上lock前缀的指令相同的屏障语义,或者在某些情况下是否提供less1或其他保证的问题相同。

我当前的最佳答案是这是Intel的意图,并且ISA文档保证mfencelock指令提供相同的防护语义,但是由于实现方面的监督,mfence实际上在最近的硬件上提供了更强的防护语义(至少是因为Haswell) 。特别是,mfence可以隔离来自WC类型的存储区域的后续非时间负载,而lock指令则不能。

我们之所以知道这一点,是因为英特尔在处理器勘误表中告诉我们这一点,例如HSD162 (Haswell)SKL155 (Skylake),它们告诉我们锁定指令不会阻止随后从WC-memory读取的非临时读取:

MOVNTDQA From WC Memory May Pass Earlier Locked Instructions

Problem: An execution of (V)MOVNTDQA (streaming load instruction) that loads from WC (write combining) memory may appear to pass an earlier locked instruction that accesses a different cache line.

Implication: Software that expects a lock to fence subsequent (V)MOVNTDQA instructions may not operate properly.

Workaround: None identified. Software that relies on a locked instruction to fence subsequent executions of (V)MOVNTDQA should insert an MFENCE instruction between the locked instruction and subsequent (V)MOVNTDQA instruction.



据此,我们可以确定(1)Intel可能打算让锁定的指令阻止WC型内存中的NT加载,否则这不是errata0.5,而(2)锁定的指令实际上没有做到这一点,英特尔无法或选择不通过微码更新来解决此问题,建议改用 mfence

在Skylake中,按照 SKL079:mfence实际上失去了针对NT负载的附加防护功能:WC内存中的MOVNTDQA可能会通过早期的MFENCE指令-与 lock -instruction勘误表几乎相同,但适用于 mfence 。但是,此勘误的状态为“BIOS可能包含针对此勘误的解决方法。”通常,这是Intel所说的“微码更新可解决此问题”。

这种勘误序列可以用时间来解释:Haswell勘误仅出现在该处理器发布数年后的2016年初,因此我们可以假设此问题在此之前引起了英特尔的注意。此时,几乎可以肯定的是Skylake已经在市场上流行了,显然采用了一种不太保守的 mfence实现,它也没有限制WC类型的内存区域上的NT负载。鉴于锁定指令的广泛使用,修复锁定指令一直可以返回到Haswell的方式可能是不可能的,也可能是昂贵的,但是需要某种方式来限制NT负载。 mfence显然已经在Haswell上完成了工作,并且Skylake将被修复,以便 mfence也可以在那里工作。

它并没有真正解释为什么SKL079( mfence一个)在2016年1月出现,比SKL155( locked一个)在2017年末出现要快两年,或者为什么后者在相同的Haswell勘误表之后出现那么多。

人们可能会猜测英特尔将来会做什么。由于他们无法/不愿意通过Skylake更改Haswell的 lock指令(代表已部署的亿万个芯片),因此他们永远无法保证锁定的指令会阻止NT负载,因此他们可能会考虑这是将来有记录的,有组织的行为。否则他们可能会更新锁定的指令,因此会围堵此类读取,但实际上,您可能十年或更长时间都不能依赖于此,直到具有当前非隔离行为的芯片几乎已经流通为止。

与Haswell相似,根据 BV116BJ138,NT负载可能会分别在Sandy Bridge和Ivy Bridge上传递更早的锁定指令。早期的微体系结构也可能会遭受此问题的困扰。 Skylake之后的Broadwell和微体系结构中似乎不存在此“错误”。

彼得·科德斯(Peter Cordes)在 this answer末尾写了一些有关Skylake mfence更改的内容。

这个答案的其余部分是我原来的答案,在我知道勘误之前,主要是出于历史利益。

旧答案

我对答案的知情猜测是 mfence提供了其他屏障功能:在使用弱序指令的访问之间(例如NT存储),以及在访问弱序区域的位置之间(例如WC类型的内存)。

就是说,这只是一个明智的猜测,您将在下面找到我的调查详细信息。

细节

文献资料

目前尚不清楚 mfence的内存一致性效果与 lock前缀指令提供的效果(包括带有隐式锁定的内存操作数的 xchg)所提供的效果不同的程度。

我认为可以肯定地说,仅就回写存储区而言,并且不涉及任何非临时访问, mfence提供了与 lock前缀操作相同的排序语义。

值得讨论的是,在上述以外的情况下,尤其是当访问涉及WB区域以外的区域或涉及非时间(流)操作时, mfence是否与 lock前缀的指令完全不同。

例如,您可以找到一些建议(例如 herehere),当涉及WC类型的操作(例如NT商店)时, mfence暗示强势垒语义。

例如,在 this thread中引用McCalpin博士(添加了重点):

The fence instruction is only needed to be absolutely sure that all of the non-temporal stores are visible before a subsequent "ordinary" store. The most obvious case where this matters is in a parallel code, where the "barrier" at the end of a parallel region may include an "ordinary" store. Without a fence, the processor might still have modified data in the Write-Combining buffers, but pass through the barrier and allow other processors to read "stale" copies of the write-combined data. This scenario might also apply to a single thread that is migrated by the OS from one core to another core (not sure about this case).

I can't remember the detailed reasoning (not enough coffee yet this morning), but the instruction you want to use after the non-temporal stores is an MFENCE. According to Section 8.2.5 of Volume 3 of the SWDM, the MFENCE is the only fence instruction that prevents both subsequent loads and subsequent stores from being executed ahead of the completion of the fence. I am surprised that this is not mentioned in Section 11.3.1, which tells you how important it is to manually ensure coherence when using write-combining, but does not tell you how to do it!



让我们检查一下英特尔SDM的8.2.5节:

Strengthening or Weakening the Memory-Ordering Model

The Intel 64 and IA-32 architectures provide several mechanisms for strengthening or weakening the memory- ordering model to handle special programming situations. These mechanisms include:

• The I/O instructions, locking instructions, the LOCK prefix, and serializing instructions force stronger ordering on the processor.

• The SFENCE instruction (introduced to the IA-32 architecture in the Pentium III processor) and the LFENCE and MFENCE instructions (introduced in the Pentium 4 processor) provide memory-ordering and serialization capabilities for specific types of memory operations.

These mechanisms can be used as follows:

Memory mapped devices and other I/O devices on the bus are often sensitive to the order of writes to their I/O buffers. I/O instructions can be used to (the IN and OUT instructions) impose strong write ordering on such accesses as follows. Prior to executing an I/O instruction, the processor waits for all previous instructions in the program to complete and for all buffered writes to drain to memory. Only instruction fetch and page tables walks can pass I/O instructions. Execution of subsequent instructions do not begin until the processor determines that the I/O instruction has been completed.

Synchronization mechanisms in multiple-processor systems may depend upon a strong memory-ordering model. Here, a program can use a locking instruction such as the XCHG instruction or the LOCK prefix to ensure that a read-modify-write operation on memory is carried out atomically. Locking operations typically operate like I/O operations in that they wait for all previous instructions to complete and for all buffered writes to drain to memory (see Section 8.1.2, “Bus Locking”).

Program synchronization can also be carried out with serializing instructions (see Section 8.3). These instructions are typically used at critical procedure or task boundaries to force completion of all previous instructions before a jump to a new section of code or a context switch occurs. Like the I/O and locking instructions, the processor waits until all previous instructions have been completed and all buffered writes have been drained to memory before executing the serializing instruction.

The SFENCE, LFENCE, and MFENCE instructions provide a performance-efficient way of ensuring load and store memory ordering between routines that produce weakly-ordered results and routines that consume that data. The functions of these instructions are as follows:

• SFENCE — Serializes all store (write) operations that occurred prior to the SFENCE instruction in the program instruction stream, but does not affect load operations.

• LFENCE — Serializes all load (read) operations that occurred prior to the LFENCE instruction in the program instruction stream, but does not affect store operations.

• MFENCE — Serializes all store and load operations that occurred prior to the MFENCE instruction in the program instruction stream.

Note that the SFENCE, LFENCE, and MFENCE instructions provide a more efficient method of controlling memory ordering than the CPUID instruction.



与McCalpin博士的解释相反,我认为本节对于 mfence是否还有其他作用还有些模棱两可。涉及IO,锁定指令和序列化指令的三个部分确实暗示它们在操作之前和之后的存储器操作之间提供了完全的屏障。他们对弱指令存储区没有任何异常(exception),对于IO指令,人们也将假定它们需要与弱指令存储区以一致的方式工作,因为此类常用于IO。

然后是 FENCE指令的部分,其中明确提到了弱内存区域:“SFENCE,LFENCE和MFENCE指令**提供了一种性能高效的方法,可确保在产生弱排序结果的例程与执行弱排序结果的例程之间进行加载和存储内存排序消耗这些数据。”

我们是否在各行之间阅读并认为这是完成此操作的唯一指令,并且前面提到的技术(包括锁定指令)对弱内存区域无济于事?通过注意到栅栏指令是在与弱顺序的非临时存储指令同时引入的,以及像 11.6.13的可缓存性提示指令这样的文本专门针对弱指令而引入的,我们可以找到对此思想的某种支持。 :

The degree to which a consumer of data knows that the data is weakly ordered can vary for these cases. As a result, the SFENCE or MFENCE instruction should be used to ensure ordering between routines that produce weakly-ordered data and routines that consume the data. SFENCE and MFENCE provide a performance-efficient way to ensure ordering by guaranteeing that every store instruction that precedes SFENCE/MFENCE in program order is globally visible before a store instruction that follows the fence.



同样,这里特别提到了围栏指令,适用于围栏弱命令。

我们还发现以下观点的支持:锁定的指令可能无法在上述已引用的最后一句的弱排序访问之间提供一个障碍:

Note that the SFENCE, LFENCE, and MFENCE instructions provide a more efficient method of controlling memory ordering than the CPUID instruction.



基本上,这意味着 FENCE指令从内存顺序上实质上替代了序列化 cpuid先前提供的功能。但是,如果 lock前缀的指令提供了与 cpuid相同的屏障功能,则可能是以前建议的方式,因为它们通常比 cpuid快得多,后者通常需要200个或更多的周期。这意味着存在某些场景(可能是弱排序的场景)不能使用 lock前缀的指令处理,使用 cpuid的位置以及现在建议使用 mfence替代的位置,这意味着比 lock前缀的指令更强的屏障语义。

但是,我们可以用不同的方式来解释上述内容:请注意,在围栏指令的上下文中,经常提到它们是确保排序的性能有效方式。因此,这些说明可能无意提供其他障碍,而只是提供了更有效的障碍。

确实,与几个序列化的像 sfencecpuid前缀的指令(通常是20个周期或更多)相比, lock在几个周期上要快得多。另一方面,至少在现代硬件上, mfence通常不比锁定指令快4。不过,它在引入时或在将来的设计中可能会更快,或者可能会更快,但并没有成功。

因此,我无法根据手册的这些部分进行一定的评估:我认为您可以提出合理的论据,以任何一种方式进行解释。

我们可以进一步查看英特尔ISA指南中有关各种非临时性存储说明的文档。例如,在非临时商店 movnti的文档中,您可以找到以下引号:

Because the WC protocol uses a weakly-ordered memory consistency model, a fencing operation implemented with the SFENCE or MFENCE instruction should be used in conjunction with MOVNTI instructions if multiple processors might use different memory types to read/write the destination memory locations.



“如果多个处理器可能使用不同的内存类型来读取/写入目标内存位置”这一部分让我有些困惑。我希望这样,而不是说诸如“使用弱顺序的提示在指令之间以全局可见的写入顺序强制执行顺序”之类的话。确实,实际的内存类型(例如,由MTTR定义)在这里甚至没有发挥作用:使用弱排序的指令时,排序问题仅会出现在WB内存中。

表现

据报道,基于Agner fog的指令时序, mfence指令在现代CPU上需要33个周期(背对背延迟),但是据报道,像 lock cmpxchg这样的更复杂的锁定指令仅需要18个周期。

如果 mfence提供的障碍语义不比 lock cmpxchg强,则后者严格执行更多工作,并且没有明显的理由使 mfence花费更长的时间。当然,您可能会说 lock cmpxchgmfence更重要,因此可以得到更多的优化。所有锁定的指令都比 mfence甚至不常用的指令都快得多的事实削弱了该论点。而且,您会想象如果所有 lock指令共享一个障碍实现,那么 mfence将只使用相同的屏障实现,因为这是最简单且最容易验证的方法。

因此,我认为 mfence的性能较慢是有力的证据,表明 mfence在做些额外的工作。

0.5这不是一个水密的论点。有些事情可能会在勘误表中出现,显然是“设计使然”的,而不是错误,例如 popcnt对目标寄存器的虚假依赖关系-因此某些勘误表可以视为更新期望值的一种文档形式,而不是总是暗示硬件错误。

1显然, lock前缀的指令还执行原子操作,这是不可能仅通过 mfence才能实现的,因此前缀 lock的指令肯定具有附加功能。因此,为了使 mfence有用,我们希望它在某些情况下具有附加的障碍语义,或者表现得更好。

2在散文有所不同的情况下,他完全有可能阅读了不同版本的手册。

SSE中的3个 SFENCE,SSE2中的 lfencemfence

4而且通常速度较慢:Agner在最新的硬件上将其列出为33个周期的延迟,而锁定的指令通常大约为20个周期。

关于multithreading - 锁xchg是否具有与mfence相同的行为?,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/40409297/

25 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com