gpt4 book ai didi

performance - L2 TLB丢失后会发生什么?

转载 作者:行者123 更新时间:2023-12-03 07:31:48 26 4
gpt4 key购买 nike

我正在努力了解当翻译后备缓冲区的前两个级别未命中时会发生什么情况?

我不确定在特殊的硬件电路中是否发生“页面漫游”,或者页面表是否存储在L2 / L3高速缓存中,或者它们是否仅驻留在主存储器中。

最佳答案

(其中一些是x86和Intel特定的。大多数关键点适用于执行硬件页面遍历的任何CPU。我还讨论了诸如MIPS之类的ISA,它们通过软件来处理TLB丢失。)
现代x86微体系结构具有专用的页面遍历硬件。他们甚至可以推测性地进行页面遍历以加载TLB条目,直到TLB未命中。为了支持硬件虚拟化,页面漫游器可以在主机VM中处理访客页面表。 ( guest 物理内存=主机虚拟内存,或多或少。VMWare发布了a paper with a summary of EPT, and benchmarks on Nehalem)。
Skylake甚至可以一次飞行两个页面,请参阅Section 2.1.3 of Intel's optimization manual。 (英特尔还将页面拆分的负载损失从〜100降低到〜5或10个额外的延迟周期,与高速缓存行拆分大致相同,但吞吐量更差。这可能是相关的,或者可能添加了第二个页面遍历单元是对发现分页访问(和TLB未命中?)比以前在实际工作负载中估计的更重要的单独回应。
一些微体系结构protect you from speculative page-walks通过在推测性加载未缓存的PTE但随后在第一次真正使用该条目之前使用存储将其存储到页表中时将其视为错误推测来将其视为错误推测。即监听用于存储到页面表中的条目,用于任何先前说明尚未在体系结构上引用的仅用于推测的TLB条目。
(Win9x依赖于此,并且不破坏重要的现有代码是CPU供应商所关心的事情。编写Win9x时,当前的TLB无效规则尚不存在,因此它甚至不是bug;请参阅下面引用的Andy Glew的评论)。 AMD Bulldozer系列违反了这一假设,仅给您x86手册在纸上说的话。

由页面遍历硬件生成的页面表负载可能会命中L1,L2或L3高速缓存。例如, Broadwell perf counters可以在您选择的L1,L2,L3或内存(即高速缓存未命中)中计算页面浏览的点击次数。对于L1 + FB中的DTLB页面遍历命中次数,事件名称为PAGE_WALKER_LOADS.DTLB_L1,对于ITLB和其他级别的缓存,事件名称为invlpg
由于现代page tables use a radix-tree format的页面目录条目指向页面表条目的表,因此更高级别的PDE(页面目录条目)可以在页面漫游硬件中进行缓存。 这是you need to flush the TLB in cases where you might think you didn't need to。英特尔和AMD实际上是这样做的,即according to this paper (section 3)
该论文说,AMD CPU上的页面遍历负载忽略了L1,但确实经过了L2。 (也许是为了避免污染L1,或减少对读取端口的争用)。无论如何,这使得在页面漫游硬件内部缓存一些高级PDE(每个PDE覆盖许多不同的翻译条目)变得更加有值(value),因为指针追随链的成本更高且延迟更高。
但请注意,x86保证不对TLB条目进行负缓存。将页面从无效更改为有效不需要 #PF 。 (因此,如果真正的实现确实想要进行这种否定缓存,则它必须监听或以某种方式仍然实现x86手册所保证的语义。)
(历史记录:Andy Glew's answer to a duplicate of this question over on electronics.SE表示P5及更早版本中的,硬件分页加载绕过了内部L1缓存(通常是直写操作,因此这使分页操作与商店保持一致)。IIRC,我的Pentium MMX主板上有L2缓存主板,也许作为内存缓存,安迪也确认P6和更高版本确实从普通的L1d缓存加载。
另一个答案的末尾也有一些有趣的链接,包括我在上一段末尾链接的论文。似乎还认为OS可能会在页面错误(HW Pagewalk找不到条目)上更新TLB本身,而不仅仅是页面表,并且想知道是否可以在x86上禁用HW页面漫游。 (但实际上,操作系统只是修改了内存中的页表,而从wrmsr返回则重新运行了错误的指令,因此HW Pagewalk这次将成功。)也许本文考虑的是ISA,例如MIPS,其中软件TLB管理/未处理是可能。
我认为实际上不可能在P5(或任何其他x86)上禁用HW Pagewalk。这将需要一种方法来使软件使用专用指令(不存在)或oprofile或MMIO存储来更新TLB条目。令人困惑的是,安迪(在下面引用的一个线程中)说,在P5上,软件TLB的处理速度更快。我认为他的意思是,如果可能的话,会更快。当时他在Imation(在MIPS上)工作,与x86 AFAIK不同,SW翻页是一个选项(有时是唯一的选项)。

正如Paul Clayton points out(另一个有关TLB未命中的问题)一样,硬件分页浏览的最大优势是TLB未命中并不一定会使CPU 停顿。 (乱序执行通常会继续进行,直到由于加载/存储无法退出而导致重新排序缓冲区填满为止。报废是按顺序进行的,因为如果出现以下情况,CPU无法正式提交不应发生的任何事情:前一条指令有误。)
顺便说一句,可能有可能通过捕获微码而不是专用的硬件状态机来构建处理TLB丢失的x86 CPU。这会(大大降低)性能,并且可能不值得进行推测性触发(因为从微码发出uops意味着您不能从正在运行的代码中发出指令。)
如果您在SMT风格的单独硬件线程(interesting idea)中运行这些微指令,则从理论上讲,微码TLB处理可能不会很糟糕。
您需要它具有比普通超线程少的启动/停止开销,以便从单线程切换到两个 Activity 的逻辑内核(必须等待所有事情耗尽,直到它可以划分ROB,存储队列等),因为与通常的逻辑核心相比,它将非常频繁地启动/停止。但这可能是可能的,如果它不是真正完全独立的线程,而只是一些单独的退出状态,因此其中的高速缓存未命中不会阻止主要代码的退出,并使它使用几个隐藏的内部寄存器来存放临时对象。它必须运行的代码由CPU设计人员选择,因此额外的硬件线程不必接近x86内核的完整体系结构状态。它几乎不需要做任何存储(也许只是为了PTE中的访问标志?),因此让这些存储使用与主线程相同的存储队列也不错。您只需要对前端进行分区以混入TLB管理uops,并让它们与主线程无序执行。如果您可以将每页步行的uops数量保持在很小的水平,则可能不会很糟糕。
据我所知,实际上没有CPU在单独的硬件线程中使用微代码来进行“硬件”页面遍历,但这是理论上的可能性。

软件TLB处理:某些RISC就是这样,而不是x86
In some RISC architectures (like MIPS), the OS kernel is responsible for handling TLB misses 。 TLB未命中会导致执行内核的TLB未命中中断处理程序。这意味着操作系统可以在此类架构上自由定义其自己的页表格式。我猜写后将页面标记为脏页面,如果CPU不知道页面表格式,那么这也需要捕获OS提供的例程。
This chapter from an operating systems textbook解释了虚拟内存,页表和TLB。它们描述了软件管理的TLB(MIPS,SPARCv9)和硬件管理的TLB(x86)之间的区别。如果需要一个真实的示例,则文件A Look at Several Memory Management Units,TLB-Refill Mechanisms, and Page Table Organizations显示了一些示例代码,这些代码来自Ultrix中的TLB未命中处理程序。

其他连结

  • How does CPU make data request via TLBs and caches?此重复项。
  • Measuring TLB miss handling cost in x86-64描述Westmere的Page Walk Cycles性能计数器。 (显然是第二代Nehalem = Westmere的新产品)
  • https://lwn.net/Articles/379748/(Linux大页面支持/性能,讨论了一些有关PowerPC和x86,并使用ojit_code来计算页面遍历周期的信息)
  • What Every Programmer Should Know About Memory?
  • Understanding TLB from CPUID results on Intel我的回答包括有关TLB的一些背景知识,包括为何跨内核共享L3TLB毫无意义。 (摘要:因为与数据不同,页面转换是线程专用的。而且,更多/更好的页面遍历硬件和TLB预取在更多情况下有助于降低L1i / dTLB丢失的平均成本。)

  • 英特尔P6(Pentium Pro / II / III)的架构师之一Andy Glew的 Comments about TLB coherency,后来在AMD工作。

    The main reason Intel started running the page table walks through the cache, rather than bypassing the cache, was performance. Prior to P6 page table walks were slow, not benefitting from cache, and were non-speculative. Slow enough that software TLB miss handling was a performance win1. P6 sped TLB misses up by doing them speculatively, using the cache, and also by caching intermediate nodes like page directory entries.

    By the way, AMD was reluctant to do TLB miss handling speculatively. I think because they were influenced by DEC VAX Alpha architects. One of the DEC Alpha architects told me rather emphatically that speculative handling of TLB misses, such as P6 was doing, was incorrect and would never work. When I arrived at AMD circa 2002 they still had something called a "TLB Fence" - not a fence instruction, but a point in the rop or microcode sequence where TLB misses either could or could not be allowed to happen - I am afraid that I do not remember exactly how it worked.

    so I think that it is not so much that Bulldozer abandoned TLB and page table walking coherency, whatever that means, as that Bulldozer may have been the first AMD machine to do moderately aggressive TLB miss handling.

    recall that when P6 was started P5 was not shipping: the existing x86es all did cache bypass page table walking in-order, non-speculatively, no asynchronous prefetches, but on write through caches. I.e. They WERE cache coherent, and the OS could rely on deterministic replacement of TLB entries. IIRC I wrote those architectural rules about speculative and non-deterministic cacheability, both for TLB entries and for data and instruction caches. You can't blame OSes like Windows and UNIX and Netware for not following page table and TLB management rules that did not exist at the time.

    IIRC I wrote those architectural rules about speculative and non-deterministic cacheability, both for TLB entries and for data and instruction caches. You can't blame OSes like Windows and UNIX and Netware for not following page table and TLB management rules that did not exist at the time.


    脚注1:就我所知,没有x86 CPU支持软件TLB管理。我认为Andy打算在P5上说“会更快”,因为无论如何它都不会投机或乱序,并且运行带有物理地址的x86指令(禁用了分页功能以避免catch 22)将允许缓存页表加载。安迪可能在想MIPS,这是他当时的日常工作。

    更多来自Andy Glew的 from the same thread,因为这些注释值得在某个地方提供完整的答案。

    (2) one of my biggest regrets wrt P6 is that we did not provide Intra-instruction TLB consistency support. Some instructions access the same page more than once. It was possible for different uops in the same instruction to get different translations for the same address. If we had given microcode the ability to save a physical address translation, and then use that, things would have been better IMHO.

    (2a) I was a RISC proponent when I joined P6, and my attitude was "let SW (microcode) do it".

    (2a') one of the most embarrassing bugs was related to add-with-carry to memory. In early microcode. The load would go, the carry flag would be updated, and the store could fault -but the carry flag had already been updated, so the instruction could not be restarted. // it was a simple microcode fix, doing the store before the carry flag was written - but one extra uop was enough to make that instruction not fit in the "medium speed" ucode system.

    (3) Anyway - the main "support" P6 and its descendants gave to handling TLB coherency issues was to rewalk the page tables at retirement before reporting a fault. This avoided confusing the OS by reporting a fault when the page tables said there should not be one.

    (4) meta comment: I don't think that any architecture has properly defined rules for caching of invalid TLB entries. // AFAIK most processors do not cache invalid TLB entries - except possibly Itanium with its NAT (Not A Thing) pages. But there's a real need: speculative memory accesses may be to wild addresses, miss the TLB, do an expensive page table walk, slowing down other instructions and threads - and then doing it over and over again because the fact that "this is a bad address, no need to walk the page tables" is not remembered. // I suspect that DOS attacks could use this.

    (4') worse, OSes may make implicit assumptions that invalid translations are never cached, and therefore not do a TLB invalidation or MP TLB shoot down when transitioning from invalid to valid. // Worse^2: imagine that you are caching interior nodes of the page table cache. Imagine that PD contains all invalid PDE; worse^3, that the PD contains valid d PDEs that point to PTs that are all invalid. Are you still allowed to cache those PDEs? Exactly when does the OS need to invalidate an entry?

    (4'') because MP TLB shoot downs using interprocessor interrupts were expensive, OS performance guys (like I used to be) are always making arguments like "we don't need to invalidate the TLB after changing a PTE from invalid to valid" or "from valid read-only to valid writable with a different address". Or "we don't need to invalidate the TLB after changing a PDE to point to a different PT whose PTEs are exactly the same as the original PT...". // Lots of great ingenious arguments. Unfortunately not always correct.

    Some of my computer architect friends now espouse coherent TLBs: TLBs that snoop writes just like data caches. Mainly to allow us to build even more aggressive TLBs and page table caches, if both valid and invalid entries of leaf and interior nodes. And not to have to worry about OS guys' assumptions. // I am not there yet: too expensive for low end hardware. But might be worth doing at high end.

    me: Holy crap, so that's where that extra ALU uop comes from in memory-destination ADC, even on Core2 and SnB-family? Never would have guessed, but had been puzzled by it.

    Andy: often when you "do the RISC thing" extra instructions or micro instructions are required, in a careful order. Whereas if you have "CISCy" support, like special hardware support so that a single instruction is a transaction, either all done or all not done, shorter code sequences can be used.

    Something similar applies to self modifying code: it was not so much that we wanted to make self modifying code run fast, as that trying to make the legacy mechanisms for self modifying code - draining the pipe for serializing instructions like CPUID - were slower than just snooping the Icache and pipeline. But, again, this applies to a high end machine: on a low end machine, the legacy mechanisms are fast enough and cheap.

    Ditto memory ordering. High end snooping faster; low end draining cheaper.

    It is hard to maintain this dichotomy.

    It is pretty common that a particular implementation has to implement rules compatible with but stronger than the architectural statement. But not all implementations have to do it the same way.


    这个评论主题是关于Andy对自我修改代码和看到过时指令的问题的回答。实际CPU超出纸面要求的另一种情况,因为如果您不跟踪分支之间发生的情况,始终侦听EIP / RIP附近的商店实际上比仅根据分支指令重新同步要容易得多。

    关于performance - L2 TLB丢失后会发生什么?,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/32256250/

    26 4 0
    Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
    广告合作:1813099741@qq.com 6ren.com