- html - 出于某种原因,IE8 对我的 Sass 文件中继承的 html5 CSS 不友好?
- JMeter 在响应断言中使用 span 标签的问题
- html - 在 :hover and :active? 上具有不同效果的 CSS 动画
- html - 相对于居中的 html 内容固定的 CSS 重复背景?
我想知道的是,从一个线程访问其他线程正在变异的内存位置(让我们随机说)的角度来看,lock xchg
是否具有与mfence
类似的行为。是否可以保证我获得了最新的值(value)?接下来的内存读/写指令?
我感到困惑的原因是:
8.2.2 “Reads or writes cannot be reordered with I/O instructions, locked instructions, or serializing instructions.”
-Intel 64 Developers Manual Vol. 3
mfence
状态:
Performs a serializing operation on all load-from-memory and store-to-memory instructions that were issued prior the MFENCE instruction. This serializing operation guarantees that every load and store instruction that precedes in program order the MFENCE instruction is globally visible before any load or store instruction that follows the MFENCE instruction is globally visible. The MFENCE instruction is ordered with respect to all load and store instructions, other MFENCE instructions, any SFENCE and LFENCE instructions, and any serializing instructions (such as the CPUID instruction).
-Intel 64 Developers Manual Vol 3A
mfence
几乎要刷新写缓冲区,或者至少伸向写缓冲区和其他内核,以确保我将来的加载/存储是最新的。
lock
进行了指令,但是这些指令不包含任何内存限制。然后,我看到使用内存栅栏但没有锁的无锁编程。我了解AMD64具有非常强大的内存模型,但是过时的值可以保留在缓存中。如果
lock
的行为与
mfence
不同,那么互斥锁如何帮助您查看最新值?
最佳答案
我相信您的问题与询问mfence
是否具有与x86上lock
前缀的指令相同的屏障语义,或者在某些情况下是否提供less1或其他保证的问题相同。
我当前的最佳答案是这是Intel的意图,并且ISA文档保证mfence
和lock
指令提供相同的防护语义,但是由于实现方面的监督,mfence
实际上在最近的硬件上提供了更强的防护语义(至少是因为Haswell) 。特别是,mfence
可以隔离来自WC类型的存储区域的后续非时间负载,而lock
指令则不能。
我们之所以知道这一点,是因为英特尔在处理器勘误表中告诉我们这一点,例如HSD162 (Haswell)和SKL155 (Skylake),它们告诉我们锁定指令不会阻止随后从WC-memory读取的非临时读取:
MOVNTDQA From WC Memory May Pass Earlier Locked Instructions
Problem: An execution of (V)MOVNTDQA (streaming load instruction) that loads from WC (write combining) memory may appear to pass an earlier locked instruction that accesses a different cache line.
Implication: Software that expects a lock to fence subsequent (V)MOVNTDQA instructions may not operate properly.
Workaround: None identified. Software that relies on a locked instruction to fence subsequent executions of (V)MOVNTDQA should insert an MFENCE instruction between the locked instruction and subsequent (V)MOVNTDQA instruction.
mfence
。
mfence
实际上失去了针对NT负载的附加防护功能:WC内存中的MOVNTDQA可能会通过早期的MFENCE指令-与
lock
-instruction勘误表几乎相同,但适用于
mfence
。但是,此勘误的状态为“BIOS可能包含针对此勘误的解决方法。”通常,这是Intel所说的“微码更新可解决此问题”。
mfence
实现,它也没有限制WC类型的内存区域上的NT负载。鉴于锁定指令的广泛使用,修复锁定指令一直可以返回到Haswell的方式可能是不可能的,也可能是昂贵的,但是需要某种方式来限制NT负载。
mfence
显然已经在Haswell上完成了工作,并且Skylake将被修复,以便
mfence
也可以在那里工作。
mfence
一个)在2016年1月出现,比SKL155(
locked
一个)在2017年末出现要快两年,或者为什么后者在相同的Haswell勘误表之后出现那么多。
lock
指令(代表已部署的亿万个芯片),因此他们永远无法保证锁定的指令会阻止NT负载,因此他们可能会考虑这是将来有记录的,有组织的行为。否则他们可能会更新锁定的指令,因此会围堵此类读取,但实际上,您可能十年或更长时间都不能依赖于此,直到具有当前非隔离行为的芯片几乎已经流通为止。
mfence
更改的内容。
mfence
提供了其他屏障功能:在使用弱序指令的访问之间(例如NT存储),以及在访问弱序区域的位置之间(例如WC类型的内存)。
mfence
的内存一致性效果与
lock
前缀指令提供的效果(包括带有隐式锁定的内存操作数的
xchg
)所提供的效果不同的程度。
mfence
提供了与
lock
前缀操作相同的排序语义。
mfence
是否与
lock
前缀的指令完全不同。
mfence
暗示强势垒语义。
The fence instruction is only needed to be absolutely sure that all of the non-temporal stores are visible before a subsequent "ordinary" store. The most obvious case where this matters is in a parallel code, where the "barrier" at the end of a parallel region may include an "ordinary" store. Without a fence, the processor might still have modified data in the Write-Combining buffers, but pass through the barrier and allow other processors to read "stale" copies of the write-combined data. This scenario might also apply to a single thread that is migrated by the OS from one core to another core (not sure about this case).
I can't remember the detailed reasoning (not enough coffee yet this morning), but the instruction you want to use after the non-temporal stores is an MFENCE. According to Section 8.2.5 of Volume 3 of the SWDM, the MFENCE is the only fence instruction that prevents both subsequent loads and subsequent stores from being executed ahead of the completion of the fence. I am surprised that this is not mentioned in Section 11.3.1, which tells you how important it is to manually ensure coherence when using write-combining, but does not tell you how to do it!
Strengthening or Weakening the Memory-Ordering Model
The Intel 64 and IA-32 architectures provide several mechanisms for strengthening or weakening the memory- ordering model to handle special programming situations. These mechanisms include:
• The I/O instructions, locking instructions, the LOCK prefix, and serializing instructions force stronger ordering on the processor.
• The SFENCE instruction (introduced to the IA-32 architecture in the Pentium III processor) and the LFENCE and MFENCE instructions (introduced in the Pentium 4 processor) provide memory-ordering and serialization capabilities for specific types of memory operations.
These mechanisms can be used as follows:
Memory mapped devices and other I/O devices on the bus are often sensitive to the order of writes to their I/O buffers. I/O instructions can be used to (the IN and OUT instructions) impose strong write ordering on such accesses as follows. Prior to executing an I/O instruction, the processor waits for all previous instructions in the program to complete and for all buffered writes to drain to memory. Only instruction fetch and page tables walks can pass I/O instructions. Execution of subsequent instructions do not begin until the processor determines that the I/O instruction has been completed.
Synchronization mechanisms in multiple-processor systems may depend upon a strong memory-ordering model. Here, a program can use a locking instruction such as the XCHG instruction or the LOCK prefix to ensure that a read-modify-write operation on memory is carried out atomically. Locking operations typically operate like I/O operations in that they wait for all previous instructions to complete and for all buffered writes to drain to memory (see Section 8.1.2, “Bus Locking”).
Program synchronization can also be carried out with serializing instructions (see Section 8.3). These instructions are typically used at critical procedure or task boundaries to force completion of all previous instructions before a jump to a new section of code or a context switch occurs. Like the I/O and locking instructions, the processor waits until all previous instructions have been completed and all buffered writes have been drained to memory before executing the serializing instruction.
The SFENCE, LFENCE, and MFENCE instructions provide a performance-efficient way of ensuring load and store memory ordering between routines that produce weakly-ordered results and routines that consume that data. The functions of these instructions are as follows:
• SFENCE — Serializes all store (write) operations that occurred prior to the SFENCE instruction in the program instruction stream, but does not affect load operations.
• LFENCE — Serializes all load (read) operations that occurred prior to the LFENCE instruction in the program instruction stream, but does not affect store operations.
• MFENCE — Serializes all store and load operations that occurred prior to the MFENCE instruction in the program instruction stream.
Note that the SFENCE, LFENCE, and MFENCE instructions provide a more efficient method of controlling memory ordering than the CPUID instruction.
mfence
是否还有其他作用还有些模棱两可。涉及IO,锁定指令和序列化指令的三个部分确实暗示它们在操作之前和之后的存储器操作之间提供了完全的屏障。他们对弱指令存储区没有任何异常(exception),对于IO指令,人们也将假定它们需要与弱指令存储区以一致的方式工作,因为此类常用于IO。
FENCE
指令的部分,其中明确提到了弱内存区域:“SFENCE,LFENCE和MFENCE指令**提供了一种性能高效的方法,可确保在产生弱排序结果的例程与执行弱排序结果的例程之间进行加载和存储内存排序消耗这些数据。”
The degree to which a consumer of data knows that the data is weakly ordered can vary for these cases. As a result, the SFENCE or MFENCE instruction should be used to ensure ordering between routines that produce weakly-ordered data and routines that consume the data. SFENCE and MFENCE provide a performance-efficient way to ensure ordering by guaranteeing that every store instruction that precedes SFENCE/MFENCE in program order is globally visible before a store instruction that follows the fence.
Note that the SFENCE, LFENCE, and MFENCE instructions provide a more efficient method of controlling memory ordering than the CPUID instruction.
FENCE
指令从内存顺序上实质上替代了序列化
cpuid
先前提供的功能。但是,如果
lock
前缀的指令提供了与
cpuid
相同的屏障功能,则可能是以前建议的方式,因为它们通常比
cpuid
快得多,后者通常需要200个或更多的周期。这意味着存在某些场景(可能是弱排序的场景)不能使用
lock
前缀的指令处理,使用
cpuid
的位置以及现在建议使用
mfence
替代的位置,这意味着比
lock
前缀的指令更强的屏障语义。
sfence
或
cpuid
前缀的指令(通常是20个周期或更多)相比,
lock
在几个周期上要快得多。另一方面,至少在现代硬件上,
mfence
通常不比锁定指令快4。不过,它在引入时或在将来的设计中可能会更快,或者可能会更快,但并没有成功。
movnti
的文档中,您可以找到以下引号:
Because the WC protocol uses a weakly-ordered memory consistency model, a fencing operation implemented with the SFENCE or MFENCE instruction should be used in conjunction with MOVNTI instructions if multiple processors might use different memory types to read/write the destination memory locations.
mfence
指令在现代CPU上需要33个周期(背对背延迟),但是据报道,像
lock cmpxchg
这样的更复杂的锁定指令仅需要18个周期。
mfence
提供的障碍语义不比
lock cmpxchg
强,则后者严格执行更多工作,并且没有明显的理由使
mfence
花费更长的时间。当然,您可能会说
lock cmpxchg
比
mfence
更重要,因此可以得到更多的优化。所有锁定的指令都比
mfence
甚至不常用的指令都快得多的事实削弱了该论点。而且,您会想象如果所有
lock
指令共享一个障碍实现,那么
mfence
将只使用相同的屏障实现,因为这是最简单且最容易验证的方法。
mfence
的性能较慢是有力的证据,表明
mfence
在做些额外的工作。
popcnt
对目标寄存器的虚假依赖关系-因此某些勘误表可以视为更新期望值的一种文档形式,而不是总是暗示硬件错误。
lock
前缀的指令还执行原子操作,这是不可能仅通过
mfence
才能实现的,因此前缀
lock
的指令肯定具有附加功能。因此,为了使
mfence
有用,我们希望它在某些情况下具有附加的障碍语义,或者表现得更好。
SFENCE
,SSE2中的
lfence
和
mfence
。
关于multithreading - 锁xchg是否具有与mfence相同的行为?,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/40409297/
有人可以解释一下 xchg 在这段代码中是如何工作的吗?鉴于 arrayD 是一个 1,2,3 的 DWORD 数组。 mov eax, arrayD ; eax=1 xchg eax, [array
维基百科提供的使用 x86 XCHG 命令的自旋锁的示例实现是: ; Intel syntax locked: ; The lock variable. 1
我已经看过 this answer和 this answer ,但对于 mfence 的等价或不等价,两者似乎都没有明确和明确的说明。和 xchg在没有非时间指令的假设下。 英特尔 instructi
我在学校上汇编类(class),他们问了这个问题: 接下来的非法操作有哪些: 1. mov bh,al 2. mov dh,cx 3. mov bh,bh 4. m
我有一套并测试基于 xchg 的程序集锁。我的问题是: 在使用xchg 指令时是否需要使用内存防护(mfence、sfence 或lfence)? 编辑: 64 位平台:使用 Intel nehale
在 msdn 上 http://msdn.microsoft.com/en-us/library/windows/desktop/ms684208(v=vs.85).aspx , MemoryBarr
我想在 C++ 中以原子方式实现 TestandSet。 c++中的xchg指令相当于什么操作 最佳答案 您可以使用内部函数,具体取决于您的编译器。例如在 gcc 中使用 __sync_lock_te
我假设简单的自旋锁不会进入操作系统等待这个问题的目的。 我看到简单的自旋锁通常使用 lock xchg 来实现。或 lock bts而不是 lock cmpxchg . 但不是cmpxchg如果期望不
我正在查看程序的反汇编(因为它崩溃了),并注意到很多 xchg ax, ax 我用 google 搜索了一下,发现它本质上是一个 nop,但为什么 Visual Studio 会执行 xchg
如果 mem 是共享内存位置,我是否需要: XCHG EAX,mem 或者: LOCK XCHG EAX,mem 以原子方式进行交换? 谷歌搜索会得到"is"和“否”的答案。有谁明确知道这一点吗? 最
最近我接触到了汇编语言。 x86 程序集有 an xchg instruction交换两个寄存器的内容。 由于每个 C 代码都首先转换为汇编代码,因此如果像头文件 stdio.h 中那样在 C 中内置
我是一名优秀的程序员,十分优秀!