multithreading - 锁xchg是否具有与mfence相同的行为？-6ren

multithreading - 锁xchg是否具有与mfence相同的行为？

转载作者：行者123 更新时间：2023-12-03 12:43:37

我想知道的是，从一个线程访问其他线程正在变异的内存位置(让我们随机说)的角度来看，lock xchg是否具有与mfence类似的行为。是否可以保证我获得了最新的值(value)？接下来的内存读/写指令？

我感到困惑的原因是:

8.2.2 “Reads or writes cannot be reordered with I/O instructions, locked instructions, or serializing instructions.”

-Intel 64 Developers Manual Vol. 3

这是否适用于所有线程？
mfence状态:

Performs a serializing operation on all load-from-memory and store-to-memory instructions that were issued prior the MFENCE instruction. This serializing operation guarantees that every load and store instruction that precedes in program order the MFENCE instruction is globally visible before any load or store instruction that follows the MFENCE instruction is globally visible. The MFENCE instruction is ordered with respect to all load and store instructions, other MFENCE instructions, any SFENCE and LFENCE instructions, and any serializing instructions (such as the CPUID instruction).

-Intel 64 Developers Manual Vol 3A

这听起来像是更有力的保证。听起来像 mfence几乎要刷新写缓冲区，或者至少伸向写缓冲区和其他内核，以确保我将来的加载/存储是最新的。

标记为基准时，两个指令需要大约100个周期才能完成。所以我看不到有什么大的不同。

首先，我只是感到困惑。我基于互斥锁中使用的 lock进行了指令，但是这些指令不包含任何内存限制。然后，我看到使用内存栅栏但没有锁的无锁编程。我了解AMD64具有非常强大的内存模型，但是过时的值可以保留在缓存中。如果 lock的行为与 mfence不同，那么互斥锁如何帮助您查看最新值？

最佳答案

我相信您的问题与询问mfence是否具有与x86上lock前缀的指令相同的屏障语义，或者在某些情况下是否提供less1或其他保证的问题相同。

我当前的最佳答案是这是Intel的意图，并且ISA文档保证mfence和lock指令提供相同的防护语义，但是由于实现方面的监督，mfence实际上在最近的硬件上提供了更强的防护语义(至少是因为Haswell) 。特别是，mfence可以隔离来自WC类型的存储区域的后续非时间负载，而lock指令则不能。

我们之所以知道这一点，是因为英特尔在处理器勘误表中告诉我们这一点，例如HSD162 (Haswell)和SKL155 (Skylake)，它们告诉我们锁定指令不会阻止随后从WC-memory读取的非临时读取:

MOVNTDQA From WC Memory May Pass Earlier Locked Instructions

Problem: An execution of (V)MOVNTDQA (streaming load instruction) that loads from WC (write combining) memory may appear to pass an earlier locked instruction that accesses a different cache line.

Implication: Software that expects a lock to fence subsequent (V)MOVNTDQA instructions may not operate properly.

Workaround: None identified. Software that relies on a locked instruction to fence subsequent executions of (V)MOVNTDQA should insert an MFENCE instruction between the locked instruction and subsequent (V)MOVNTDQA instruction.

据此，我们可以确定(1)Intel可能打算让锁定的指令阻止WC型内存中的NT加载，否则这不是errata0.5，而(2)锁定的指令实际上没有做到这一点，英特尔无法或选择不通过微码更新来解决此问题，建议改用 mfence。

在Skylake中，按照 SKL079:mfence实际上失去了针对NT负载的附加防护功能:WC内存中的MOVNTDQA可能会通过早期的MFENCE指令-与 lock -instruction勘误表几乎相同，但适用于 mfence 。但是，此勘误的状态为“BIOS可能包含针对此勘误的解决方法。”通常，这是Intel所说的“微码更新可解决此问题”。

这种勘误序列可以用时间来解释:Haswell勘误仅出现在该处理器发布数年后的2016年初，因此我们可以假设此问题在此之前引起了英特尔的注意。此时，几乎可以肯定的是Skylake已经在市场上流行了，显然采用了一种不太保守的 mfence实现，它也没有限制WC类型的内存区域上的NT负载。鉴于锁定指令的广泛使用，修复锁定指令一直可以返回到Haswell的方式可能是不可能的，也可能是昂贵的，但是需要某种方式来限制NT负载。 mfence显然已经在Haswell上完成了工作，并且Skylake将被修复，以便 mfence也可以在那里工作。

它并没有真正解释为什么SKL079( mfence一个)在2016年1月出现，比SKL155( locked一个)在2017年末出现要快两年，或者为什么后者在相同的Haswell勘误表之后出现那么多。

人们可能会猜测英特尔将来会做什么。由于他们无法/不愿意通过Skylake更改Haswell的 lock指令(代表已部署的亿万个芯片)，因此他们永远无法保证锁定的指令会阻止NT负载，因此他们可能会考虑这是将来有记录的，有组织的行为。否则他们可能会更新锁定的指令，因此会围堵此类读取，但实际上，您可能十年或更长时间都不能依赖于此，直到具有当前非隔离行为的芯片几乎已经流通为止。

与Haswell相似，根据 BV116和 BJ138，NT负载可能会分别在Sandy Bridge和Ivy Bridge上传递更早的锁定指令。早期的微体系结构也可能会遭受此问题的困扰。 Skylake之后的Broadwell和微体系结构中似乎不存在此“错误”。

彼得·科德斯(Peter Cordes)在 this answer末尾写了一些有关Skylake mfence更改的内容。

这个答案的其余部分是我原来的答案，在我知道勘误之前，主要是出于历史利益。

旧答案

我对答案的知情猜测是 mfence提供了其他屏障功能:在使用弱序指令的访问之间(例如NT存储)，以及在访问弱序区域的位置之间(例如WC类型的内存)。

就是说，这只是一个明智的猜测，您将在下面找到我的调查详细信息。

细节

文献资料

目前尚不清楚 mfence的内存一致性效果与 lock前缀指令提供的效果(包括带有隐式锁定的内存操作数的 xchg)所提供的效果不同的程度。

我认为可以肯定地说，仅就回写存储区而言，并且不涉及任何非临时访问， mfence提供了与 lock前缀操作相同的排序语义。

值得讨论的是，在上述以外的情况下，尤其是当访问涉及WB区域以外的区域或涉及非时间(流)操作时， mfence是否与 lock前缀的指令完全不同。

例如，您可以找到一些建议(例如 here或 here)，当涉及WC类型的操作(例如NT商店)时， mfence暗示强势垒语义。

例如，在 this thread中引用McCalpin博士(添加了重点):

The fence instruction is only needed to be absolutely sure that all of the non-temporal stores are visible before a subsequent "ordinary" store. The most obvious case where this matters is in a parallel code, where the "barrier" at the end of a parallel region may include an "ordinary" store. Without a fence, the processor might still have modified data in the Write-Combining buffers, but pass through the barrier and allow other processors to read "stale" copies of the write-combined data. This scenario might also apply to a single thread that is migrated by the OS from one core to another core (not sure about this case).

I can't remember the detailed reasoning (not enough coffee yet this morning), but the instruction you want to use after the non-temporal stores is an MFENCE. According to Section 8.2.5 of Volume 3 of the SWDM, the MFENCE is the only fence instruction that prevents both subsequent loads and subsequent stores from being executed ahead of the completion of the fence. I am surprised that this is not mentioned in Section 11.3.1, which tells you how important it is to manually ensure coherence when using write-combining, but does not tell you how to do it!

让我们检查一下英特尔SDM的8.2.5节:

Strengthening or Weakening the Memory-Ordering Model

The Intel 64 and IA-32 architectures provide several mechanisms for strengthening or weakening the memory- ordering model to handle special programming situations. These mechanisms include:

• The I/O instructions, locking instructions, the LOCK prefix, and serializing instructions force stronger ordering on the processor.

• The SFENCE instruction (introduced to the IA-32 architecture in the Pentium III processor) and the LFENCE and MFENCE instructions (introduced in the Pentium 4 processor) provide memory-ordering and serialization capabilities for specific types of memory operations.

These mechanisms can be used as follows:

Memory mapped devices and other I/O devices on the bus are often sensitive to the order of writes to their I/O buffers. I/O instructions can be used to (the IN and OUT instructions) impose strong write ordering on such accesses as follows. Prior to executing an I/O instruction, the processor waits for all previous instructions in the program to complete and for all buffered writes to drain to memory. Only instruction fetch and page tables walks can pass I/O instructions. Execution of subsequent instructions do not begin until the processor determines that the I/O instruction has been completed.

Synchronization mechanisms in multiple-processor systems may depend upon a strong memory-ordering model. Here, a program can use a locking instruction such as the XCHG instruction or the LOCK prefix to ensure that a read-modify-write operation on memory is carried out atomically. Locking operations typically operate like I/O operations in that they wait for all previous instructions to complete and for all buffered writes to drain to memory (see Section 8.1.2, “Bus Locking”).

Program synchronization can also be carried out with serializing instructions (see Section 8.3). These instructions are typically used at critical procedure or task boundaries to force completion of all previous instructions before a jump to a new section of code or a context switch occurs. Like the I/O and locking instructions, the processor waits until all previous instructions have been completed and all buffered writes have been drained to memory before executing the serializing instruction.

The SFENCE, LFENCE, and MFENCE instructions provide a performance-efficient way of ensuring load and store memory ordering between routines that produce weakly-ordered results and routines that consume that data. The functions of these instructions are as follows:

• SFENCE — Serializes all store (write) operations that occurred prior to the SFENCE instruction in the program instruction stream, but does not affect load operations.

• LFENCE — Serializes all load (read) operations that occurred prior to the LFENCE instruction in the program instruction stream, but does not affect store operations.

• MFENCE — Serializes all store and load operations that occurred prior to the MFENCE instruction in the program instruction stream.

Note that the SFENCE, LFENCE, and MFENCE instructions provide a more efficient method of controlling memory ordering than the CPUID instruction.

与McCalpin博士的解释相反，我认为本节对于 mfence是否还有其他作用还有些模棱两可。涉及IO，锁定指令和序列化指令的三个部分确实暗示它们在操作之前和之后的存储器操作之间提供了完全的屏障。他们对弱指令存储区没有任何异常(exception)，对于IO指令，人们也将假定它们需要与弱指令存储区以一致的方式工作，因为此类常用于IO。

然后是 FENCE指令的部分，其中明确提到了弱内存区域:“SFENCE，LFENCE和MFENCE指令**提供了一种性能高效的方法，可确保在产生弱排序结果的例程与执行弱排序结果的例程之间进行加载和存储内存排序消耗这些数据。”

我们是否在各行之间阅读并认为这是完成此操作的唯一指令，并且前面提到的技术(包括锁定指令)对弱内存区域无济于事？通过注意到栅栏指令是在与弱顺序的非临时存储指令同时引入的，以及像 11.6.13的可缓存性提示指令这样的文本专门针对弱指令而引入的，我们可以找到对此思想的某种支持。 :

The degree to which a consumer of data knows that the data is weakly ordered can vary for these cases. As a result, the SFENCE or MFENCE instruction should be used to ensure ordering between routines that produce weakly-ordered data and routines that consume the data. SFENCE and MFENCE provide a performance-efficient way to ensure ordering by guaranteeing that every store instruction that precedes SFENCE/MFENCE in program order is globally visible before a store instruction that follows the fence.

同样，这里特别提到了围栏指令，适用于围栏弱命令。

我们还发现以下观点的支持:锁定的指令可能无法在上述已引用的最后一句的弱排序访问之间提供一个障碍:

Note that the SFENCE, LFENCE, and MFENCE instructions provide a more efficient method of controlling memory ordering than the CPUID instruction.

基本上，这意味着 FENCE指令从内存顺序上实质上替代了序列化 cpuid先前提供的功能。但是，如果 lock前缀的指令提供了与 cpuid相同的屏障功能，则可能是以前建议的方式，因为它们通常比 cpuid快得多，后者通常需要200个或更多的周期。这意味着存在某些场景(可能是弱排序的场景)不能使用 lock前缀的指令处理，使用 cpuid的位置以及现在建议使用 mfence替代的位置，这意味着比 lock前缀的指令更强的屏障语义。

但是，我们可以用不同的方式来解释上述内容:请注意，在围栏指令的上下文中，经常提到它们是确保排序的性能有效方式。因此，这些说明可能无意提供其他障碍，而只是提供了更有效的障碍。

确实，与几个序列化的像 sfence或 cpuid前缀的指令(通常是20个周期或更多)相比， lock在几个周期上要快得多。另一方面，至少在现代硬件上， mfence通常不比锁定指令快4。不过，它在引入时或在将来的设计中可能会更快，或者可能会更快，但并没有成功。

因此，我无法根据手册的这些部分进行一定的评估:我认为您可以提出合理的论据，以任何一种方式进行解释。

我们可以进一步查看英特尔ISA指南中有关各种非临时性存储说明的文档。例如，在非临时商店 movnti的文档中，您可以找到以下引号:

Because the WC protocol uses a weakly-ordered memory consistency model, a fencing operation implemented with the SFENCE or MFENCE instruction should be used in conjunction with MOVNTI instructions if multiple processors might use different memory types to read/write the destination memory locations.

“如果多个处理器可能使用不同的内存类型来读取/写入目标内存位置”这一部分让我有些困惑。我希望这样，而不是说诸如“使用弱顺序的提示在指令之间以全局可见的写入顺序强制执行顺序”之类的话。确实，实际的内存类型(例如，由MTTR定义)在这里甚至没有发挥作用:使用弱排序的指令时，排序问题仅会出现在WB内存中。

表现

据报道，基于Agner fog的指令时序， mfence指令在现代CPU上需要33个周期(背对背延迟)，但是据报道，像 lock cmpxchg这样的更复杂的锁定指令仅需要18个周期。

如果 mfence提供的障碍语义不比 lock cmpxchg强，则后者严格执行更多工作，并且没有明显的理由使 mfence花费更长的时间。当然，您可能会说 lock cmpxchg比 mfence更重要，因此可以得到更多的优化。所有锁定的指令都比 mfence甚至不常用的指令都快得多的事实削弱了该论点。而且，您会想象如果所有 lock指令共享一个障碍实现，那么 mfence将只使用相同的屏障实现，因为这是最简单且最容易验证的方法。

因此，我认为 mfence的性能较慢是有力的证据，表明 mfence在做些额外的工作。

0.5这不是一个水密的论点。有些事情可能会在勘误表中出现，显然是“设计使然”的，而不是错误，例如 popcnt对目标寄存器的虚假依赖关系-因此某些勘误表可以视为更新期望值的一种文档形式，而不是总是暗示硬件错误。

1显然， lock前缀的指令还执行原子操作，这是不可能仅通过 mfence才能实现的，因此前缀 lock的指令肯定具有附加功能。因此，为了使 mfence有用，我们希望它在某些情况下具有附加的障碍语义，或者表现得更好。

2在散文有所不同的情况下，他完全有可能阅读了不同版本的手册。

SSE中的3个 SFENCE，SSE2中的 lfence和 mfence。

4而且通常速度较慢:Agner在最新的硬件上将其列出为33个周期的延迟，而锁定的指令通常大约为20个周期。

关于multithreading - 锁xchg是否具有与mfence相同的行为？，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/40409297/

文章推荐： android - 点击了解更多信息或停止应用

文章推荐： .net - .NET ThreadPool 线程在返回池时是否会重置？

文章推荐： multithreading - 单核处理器上的单线程与多线程编程

assembly - xchg 如何在英特尔汇编语言中工作
有人可以解释一下 xchg 在这段代码中是如何工作的吗？鉴于 arrayD 是一个 1,2,3 的 DWORD 数组。 mov eax, arrayD ; eax=1 xchg eax, [array
assembly - 使用 XCHG 解锁的自旋锁
维基百科提供的使用 x86 XCHG 命令的自旋锁的示例实现是: ; Intel syntax locked: ; The lock variable. 1
multithreading - `xchg` 是否包含 `mfence` 假设没有非时间指令？
我已经看过 this answer和 this answer ，但对于 mfence 的等价或不等价，两者似乎都没有明确和明确的说明。和 xchg在没有非时间指令的假设下。英特尔 instructi
assembly - 非法使用 mov/xchg 运算符
我在学校上汇编类(class)，他们问了这个问题: 接下来的非法操作有哪些: 1. mov bh,al 2. mov dh,cx 3. mov bh,bh 4. m
c++ - 使用 xchg 时需要 mfence 吗
我有一套并测试基于 xchg 的程序集锁。我的问题是: 在使用xchg 指令时是否需要使用内存防护(mfence、sfence 或lfence)？编辑: 64 位平台:使用 Intel nehale
windows - 为什么可以将 MemoryBarrier 实现为对 xchg 的调用？
在 msdn 上 http://msdn.microsoft.com/en-us/library/windows/desktop/ms684208(v=vs.85).aspx , MemoryBarr
c++ - 如何以原子方式在 C++ 编码中执行 xchg 汇编指令
我想在 C++ 中以原子方式实现 TestandSet。 c++中的xchg指令相当于什么操作最佳答案您可以使用内部函数，具体取决于您的编译器。例如在 gcc 中使用 __sync_lock_te
assembly - cmpxchg 是否会在失败时写入目标缓存行？如果不是，它是否比自旋锁的 xchg 更好？
我假设简单的自旋锁不会进入操作系统等待这个问题的目的。我看到简单的自旋锁通常使用 lock xchg 来实现。或 lock bts而不是 lock cmpxchg . 但不是cmpxchg如果期望不
assembly - 为什么 Visual Studio 使用 xchg ax,ax
我正在查看程序的反汇编(因为它崩溃了)，并注意到很多 xchg ax, ax 我用 google 搜索了一下，发现它本质上是一个 nop，但为什么 Visual Studio 会执行 xchg
multithreading - 在多核 x86 上，是否需要将 LOCK 作为 XCHG 的前缀？
如果 mem 是共享内存位置，我是否需要: XCHG EAX,mem 或者: LOCK XCHG EAX,mem 以原子方式进行交换？谷歌搜索会得到"is"和“否”的答案。有谁明确知道这一点吗？最
c - 为什么 C 中没有内置的 swap 函数，而 Assembly 中有 xchg？
最近我接触到了汇编语言。 x86 程序集有 an xchg instruction交换两个寄存器的内容。由于每个 C 代码都首先转换为汇编代码，因此如果像头文件 stdio.h 中那样在 C 中内置

行者123

个人简介

我是一名优秀的程序员,十分优秀！

作者热门文章

滴滴打车优惠券免费领取

全站热门文章

首页

博学

6Ren·AI

商城

multithreading - 锁xchg是否具有与mfence相同的行为？