c - 使用自修改代码观察 x86 上的陈旧指令提取-6ren

c - 使用自修改代码观察 x86 上的陈旧指令提取

转载作者：行者123 更新时间：2023-12-02 06:46:05

我被告知并从英特尔的手册中读到可以将指令写入内存，但指令预取队列已经获取过时的指令并将执行那些旧指令。我没有成功地观察到这种行为。我的方法如下。

英特尔软件开发手册第 11.6 节指出

A write to a memory location in a code segment that is currently cached in the processor causes the associated cache line (or lines) to be invalidated. This check is based on the physical address of the instruction. In addition, the P6 family and Pentium processors check whether a write to a code segment may modify an instruction that has been prefetched for execution. If the write affects a prefetched instruction, the prefetch queue is invalidated. This latter check is based on the linear address of the instruction.

所以，看起来如果我希望执行过时的指令，我需要有两个不同的线性地址指向同一个物理页面。所以，我将一个文件内存映射到两个不同的地址。

int fd = open("code_area", O_RDWR | O_CREAT, S_IRWXU | S_IRWXG | S_IRWXO);
assert(fd>=0);
write(fd, zeros, 0x1000);
uint8_t *a1 = mmap(NULL, 0x1000, PROT_READ | PROT_WRITE | PROT_EXEC,
        MAP_FILE | MAP_SHARED, fd, 0);
uint8_t *a2 = mmap(NULL, 0x1000, PROT_READ | PROT_WRITE | PROT_EXEC,
        MAP_FILE | MAP_SHARED, fd, 0);
assert(a1 != a2);

我有一个汇编函数，它接受一个参数，一个指向我要更改的指令的指针。

fun:
    push %rbp
    mov %rsp, %rbp

    xorq %rax, %rax # Return value 0

# A far jump simulated with a far return
# Push the current code segment %cs, then the address we want to far jump to

    xorq %rsi, %rsi
    mov %cs, %rsi
    pushq %rsi
    leaq copy(%rip), %r15
    pushq %r15
    lretq

copy:
# Overwrite the two nops below with `inc %eax'. We will notice the change if the
# return value is 1, not zero. The passed in pointer at %rdi points to the same physical
# memory location of fun_ins, but the linear addresses will be different.
    movw $0xc0ff, (%rdi)

fun_ins:
    nop   # Two NOPs gives enough space for the inc %eax (opcode FF C0)
    nop
    pop %rbp
    ret
fun_end:
    nop

在 C 中，我将代码复制到内存映射文件中。我从线性地址 a1 调用函数，但我传递了一个指向 a2 的指针作为代码修改的目标。

#define DIFF(a, b) ((long)(b) - (long)(a))
long sz = DIFF(fun, fun_end);
memcpy(a1, fun, sz);
void *tochange = DIFF(fun, fun_ins);
int val = ((int (*)(void*))a1)(tochange);

如果 CPU 接收到修改后的代码，则 val==1。否则，如果执行了过时的指令(两个 nops)，则 val==0。

我已经在 1.7GHz Intel Core i5(2011 macbook air)和 Intel(R) Xeon(R) CPU X3460 @ 2.80GHz 上运行了它。然而，每次我看到 val==1 表示 CPU 总是注意到新指令。

有没有人经历过我想观察的行为？我的推理正确吗？我对手册中提到 P6 和 Pentium 处理器以及没有提到我的 Core i5 处理器感到有些困惑。也许还有其他原因导致 CPU 刷新其指令预取队列？任何见解都会非常有帮助!

最佳答案

我想，你应该检查 MACHINE_CLEARS.SMC CPU 的性能计数器(MACHINE_CLEARS 事件的一部分)(它在 Sandy Bridge 1 中可用，用于您的 Air powerbook；也可以在您的 Xeon 上可用，即 Nehalem 2 - 搜索“smc” )。您可以使用 oprofile , perf或英特尔的 Vtune找到它的值(value):

http://software.intel.com/sites/products/documentation/doclib/iss/2013/amplifier/lin/ug_docs/GUID-F0FD7660-58B5-4B5D-AA9A-E1AF21DDCA0E.htm

Machine Clears

Metric Description

Certain events require the entire pipeline to be cleared and restarted from just after the last retired instruction. This metric measures three such events: memory ordering violations, self-modifying code, and certain loads to illegal address ranges.

Possible Issues

A significant portion of execution time is spent handling machine clears. Examine the MACHINE_CLEARS events to determine the specific cause.

SMC: http://software.intel.com/sites/products/documentation/doclib/stdxe/2013/amplifierxe/win/win_reference/snb/events/machine_clears.html

MACHINE_CLEARS Event Code: 0xC3 SMC Mask: 0x04

Self-modifying code (SMC) detected.

Number of self-modifying-code machine clears detected.

英特尔还说关于 smc http://software.intel.com/en-us/forums/topic/345561 (链接自 Intel Performance Bottleneck Analyzer's taxonomy

This event fires when self-modifying code is detected. This can be typically used by folks who do binary editing to force it to take certain path (e.g. hackers). This event counts the number of times that a program writes to a code section. Self-modifying code causes a severe penalty in all Intel 64 and IA-32 processors. The modified cache line is written back to the L2 and LLC caches. Also, the instructions would need to be re-loaded hence causing performance penalty.

我想，你会看到一些这样的事件。如果是，则 CPU 能够检测到自我修改代码的行为并引发“机器清除”——管道完全重启。第一阶段是 Fetch，他们会向 L2 缓存请求新的操作码。我对每次执行代码的 SMC 事件的确切计数非常感兴趣 - 这会给我们一些关于延迟的估计..(SMC 以某些单位计算，其中 1 个单位被假定为 1.5 个 cpu 周期 - B.6.2。 intel优化手册6篇)

我们可以看到英特尔说“从最后一条退役指令之后重新启动。”，所以我认为最后一条退役指令将是 mov ;你的 nops 已经在筹备中。但是 SMC 将在 mov 退休时提高，它将杀死所有正在筹备中的东西，包括 nops。

这种 SMC 引起的管道重启并不便宜，Agner 在 Optimizing_assembly.pdf 中有一些测量- “17.10 自修改代码(所有处理器)”(我认为这里的任何 Core2/CoreiX 都像 PM):

The penalty for executing a piece of code immediately after modifying it is approximately 19 clocks for P1, 31 for PMMX, and 150-300 for PPro, P2, P3, PM. The P4 will purge the entire trace cache after self-modifying code. The 80486 and earlier processors require a jump between the modifying and the modified code in order to flush the code cache. ...

Self-modifying code is not considered good programming practice. It should be used only if the gain in speed is substantial and the modified code is executed so many times that the advantage outweighs the penalties for using self-modifying code.

这里推荐使用不同的线性地址来使 SMC 检测器失效:
https://stackoverflow.com/a/10994728/196561 - 我会尝试找到实际的英特尔文档...现在实际上无法回答您的真正问题。

这里可能有一些提示: Optimization manual, 248966-026, April 2012 “3.6.9 混合代码和数据”:

Placing writable data in the code segment might be impossible to distinguish from self-modifying code. Writable data in the code segment might suffer the same performance penalty as self-modifying code.

和下一节

Software should avoid writing to a code page in the same 1-KByte subpage that is being executed or fetching code in the same 2-KByte subpage of that is being written. In addition, sharing a page containing directly or speculatively executed code with another processor as a data page can trigger an SMC condition that causes the entire pipeline of the machine and the trace cache to be cleared. This is due to the self-modifying code condition.

因此，可能有一些控制可写和可执行子页面交叉的原理图。

您可以尝试从其他线程(交叉修改代码)进行修改——但是需要非常小心的线程同步和管道刷新(您可能希望在写入线程中包含一些强制延迟；同步后的 CPUID需要)。但是你应该知道他们已经使用“ nukes ”修复了这个问题 - 检查 US6857064专利。

I'm a little confused about the manual mentioning P6 and Pentium processors

如果您已经获取、解码并执行了一些过时的英特尔说明手册版本，则这是可能的。您可以重置管道并检查此版本: Order Number: 325462-047US, June 2013 “11.6 自我修改代码”。这个版本仍然没有说明更新的 CPU，但提到当您使用不同的虚拟地址进行修改时，微架构之间的行为可能不兼容(它可能适用于您的 Nehalem/Sandy Bridge，可能不适用于 .. Skymont)

11.6 SELF-MODIFYING CODE A write to a memory location in a code segment that is currently cached in the processor causes the associated cache line (or lines) to be invalidated. This check is based on the physical address of the instruction. In addition, the P6 family and Pentium processors check whether a write to a code segment may modify an instruction that has been prefetched for execution. If the write affects a prefetched instruction, the prefetch queue is invalidated. This latter check is based on the linear address of the instruction. For the Pentium 4 and Intel Xeon processors, a write or a snoop of an instruction in a code segment, where the target instruction is already decoded and resident in the trace cache, invalidates the entire trace cache. The latter behavior means that programs that self-modify code can cause severe degradation of performance when run on the Pentium 4 and Intel Xeon processors.

In practice, the check on linear addresses should not create compatibility problems among IA-32 processors. Applications that include self-modifying code use the same linear address for modifying and fetching the instruction.

Systems software, such as a debugger, that might possibly modify an instruction using a different linear address than that used to fetch the instruction, will execute a serializing operation, such as a CPUID instruction, before the modified instruction is executed, which will automatically resynchronize the instruction cache and prefetch queue. (See Section 8.1.3, “Handling Self- and Cross-Modifying Code,” for more information about the use of self-modifying code.)

For Intel486 processors, a write to an instruction in the cache will modify it in both the cache and memory, but if the instruction was prefetched before the write, the old version of the instruction could be the one executed. To prevent the old instruction from being executed, flush the instruction prefetch unit by coding a jump instruction immediately after any write that modifies an instruction

真实更新 , 谷歌搜索 《SMC检测》 (带引号)，还有一些现代 Core2/Core iX 如何检测 SMC 的详细信息，以及许多 Xeons 和 Pentiums 卡在 SMC 检测器中的勘误表:

http://www.google.com/patents/US6237088用于在管道中跟踪飞行指令的系统和方法@ 2001

DOI 10.1535/itj.1203.03(谷歌搜索，在 citeseerx.ist.psu.edu 上有免费版本)——Penryn 中添加了“包含过滤器”以减少错误的 SMC 检测； “现有的夹杂物检测机制”如图 9

http://www.google.com/patents/US6405307 - SMC 检测逻辑的旧专利

根据专利 US6237088(图 5，摘要)，存在“行地址缓冲区”(具有许多线性地址，每个提取指令一个地址——或者换句话说，缓冲区充满了具有缓存行精度的提取 IP)。每个存储，或更确切地说，每个存储的“存储地址”阶段将被送入并行比较器进行检查，存储是否与当前正在执行的任何指令相交。

两个专利都没有明确说明，它们会在SMC逻辑中使用物理地址还是逻辑地址... Sandy Bridge中的L1i是VIPT( Virtually indexed, physically tagged，标签中索引和物理地址的虚拟地址。)根据 http://nick-black.com/dankwiki/index.php/Sandy_Bridge所以我们得到了 L1 缓存返回数据时的物理地址。我认为英特尔可能会在 SMC 检测逻辑中使用物理地址。

更多， http://www.google.com/patents/US6594734 @ 1999(2003 年发布，请记住 CPU 设计周期大约为 3-5 年)在“摘要”部分说 SMC 现在在 TLB 中并使用物理地址(或者换句话说 - 请不要试图欺骗SMC检测器):

Self modifying code is detected using a translation lookaside buffer .. [which] has physical page addresses stored therein over which snoops can be performed using the physical memory address of a store into memory. ... To provide finer granularity than a page of addresses, FINE HIT bits are included with each entry in the cache associating information in the cache to portions of a page within memory.

(页面的一部分，在专利 US6594734 中称为象限，听起来像 1K 个子页面，不是吗？)

然后他们说

Therefore snoops, triggered by store instructions into memory, can perform SMC detection by comparing the physical address of all instructions stored within the instruction cache with the address of all instructions stored within the associated page or pages of memory. If there is an address match, it indicates that a memory location was modified. In the case of an address match, indicating an SMC condition, the instruction cache and instruction pipeline are flushed by the retirement unit and new instructions are fetched from memory for storage into the instruction cache.

Because snoops for SMC detection are physical and the ITLB ordinarily accepts as an input a linear address to translate into a physical address, the ITLB is additionally formed as a content-addressable memory on the physical addresses and includes an additional input comparison port (referred to as a snoop port or reverse translation port)

-- 因此，为了检测 SMC，如果 snoop 的 phys.地址与存储在指令缓冲区中的缓存行冲突，我们将通过从 iTLB 传递到退休单元的 SMC 信号重新启动流水线。可以想象在这种从 dTLB 通过 iTLB 到退休单元的监听循环中将浪费多少 cpu 时钟(它不能退休下一个“nop”指令，尽管它比 mov 早执行并且没有副作用)。但是哇？ ITLB具有物理地址输入和第二个CAM(大而热)，只是为了支持和防御疯狂和作弊的自修改代码。

PS:如果我们使用大页面(4M 或可能是 1G)怎么办？ L1TLB 有巨大的页面条目，对于 4 MB 页面的 1/4，可能会有很多错误的 SMC 检测......

PPS:有一种变体，即错误处理具有不同线性地址的 SMC 仅在早期的 P6/Ppro/P2 中出现...

关于c - 使用自修改代码观察 x86 上的陈旧指令提取，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/17395557/

文章推荐： java - 在使用 gstreamer 发布的 Android 上播放 rtp 流

文章推荐： java - 如何使用客户端-服务器java程序连接2台计算机

文章推荐： Qt: session 管理错误

c++ - C c;之间有什么区别吗？和 C c = C();?
#include using namespace std; class C{ private: int value; public: C(){ value = 0;
c++ - C 风格字符串差异 : C/C++
这个问题已经有答案了: What is the difference between char a[] = ?string?; and char *p = ?string?;? (8 个回答) 已关闭
c++ - c\c++ 转换为 C#
关闭。此题需要details or clarity 。目前不接受答案。想要改进这个问题吗？通过 editing this post 添加详细信息并澄清问题. 已关闭 7 年前。此帖子已于 8 个月
c# - C、C++、C# 的功能测试工具
除了调试之外，是否有任何针对 c、c++ 或 c# 的测试工具，其工作原理类似于将独立函数复制粘贴到某个文本框，然后在其他文本框中输入参数？最佳答案也许您会考虑单元测试。我推荐你谷歌测试和谷歌模拟
c# - C/C++/C# 在监视器上设置窗口位置
我想在第二台显示器中移动一个窗口 (HWND)。问题是我尝试了很多方法，例如将分辨率加倍或输入负值，但它永远无法将窗口放在我的第二台显示器上。关于如何在 C/C++/c# 中执行此操作的任何线索最
c# - C/C++/C#中的DES实现
我正在寻找 C/C++/C## 中不同类型 DES 的现有实现。我的运行平台是Windows XP/Vista/7。我正在尝试编写一个 C# 程序，它将使用 DES 算法进行加密和解密。我需要一些实
c# - 在条件中使用赋值是否安全？ C/C++、C#
很难说出这里要问什么。这个问题模棱两可、含糊不清、不完整、过于宽泛或夸夸其谈，无法以目前的形式得到合理的回答。如需帮助澄清此问题以便重新打开，visit the help center . 关闭 1
c++ - C/C++/C# 强制窗口在最上面
有没有办法强制将另一个窗口置于顶部？不是应用程序的窗口，而是另一个已经在系统上运行的窗口。 (Windows, C/C++/C#) 最佳答案 SetWindowPos(that_window_ha
c# - 套接字服务器应用程序的选择 : C/C++ or C#
假设您可以在 C/C++ 或 Csharp 之间做出选择，并且您打算在 Windows 和 Linux 服务器上运行同一服务器的多个实例，那么构建套接字服务器应用程序的最明智选择是什么？最佳答案如
c++ - C/C++ 运行时库和 C/C++ 标准库的区别
你们能告诉我它们之间的区别吗？顺便问一下，有什么叫C++库或C库的吗？最佳答案 C++ 标准库和 C 标准库是 C++ 和 C 标准定义的库，提供给 C++ 和 C 程序使用。那是那些词的共同
c++ - &C::c 和 &(C::c) 有什么区别？
下面的测试代码，我将输出信息放在注释中。我使用的是 gcc 4.8.5 和 Centos 7.2。 #include #include class C { public:
c++ - 什么 C++(通用 (c/c++) 与 (通用 c)/c++ )
很难说出这里问的是什么。这个问题是含糊的、模糊的、不完整的、过于宽泛的或修辞性的，无法以目前的形式得到合理的回答。如需帮助澄清此问题以便重新打开它，visit the help center 。已关
c# - 通过网络在 C/C++ 服务器、C/C++ 和 C# 客户端之间发送数据结构
我的客户将使用名为 annoucement 的结构/类与客户通信。我想我会用 C++ 编写服务器。会有很多不同的类继承annoucement。我的问题是通过网络将这些类发送给客户端我想也许我应该使用
c# - C/C++ - 如何将 Buffer.BlockCopy (C#) 转换为 C/C++
我在 C# 中有以下函数: public Matrix ConcatDescriptors(IList> descriptors) { int cols = descriptors[0].Co
c++ - C/C++ - 对其他人隐藏 C 或 C++ 函数代码
我有一个项目要编写一个函数来对某些数据执行某些操作。我可以用 C/C++ 编写代码，但我不想与雇主共享该函数的代码。相反，我只想让他有权在他自己的代码中调用该函数。是否可以？我想到了这两种方法 - 在
c# - 在托管代码(C++、C、C++/CLI、C#)中使用非托管代码时处理错误
我使用的是编写糟糕的第 3 方 (C/C++) Api。我从托管代码(C++/CLI)中使用它。有时会出现“访问冲突错误”。这使整个应用程序崩溃。我知道我无法处理这些错误[如果指针访问非法内存位置等，
c# - C#、C/C++ 或 Objective-C 中的眼动追踪库
关闭。这个问题不符合Stack Overflow guidelines .它目前不接受答案。我们不允许提问寻求书籍、工具、软件库等的推荐。您可以编辑问题，以便用事实和引用来回答。关闭 7 年前。
c++ - C/C++/Objective-C 文本识别库
已关闭。此问题不符合Stack Overflow guidelines 。目前不接受答案。要求我们推荐或查找工具、库或最喜欢的场外资源的问题对于 Stack Overflow 来说是偏离主题的，因为
c# - 将 C/C++ 函数导入 C#
我有一些 C 代码，将使用 P/Invoke 从 C# 调用。我正在尝试为这个 C 函数定义一个 C# 等效项。 SomeData* DoSomething(); struct SomeData {
c - C语言中 "c -= --c - c++;"的结果应该是什么？
这个问题已经有答案了: Why are these constructs using pre and post-increment undefined behavior? (14 个回答) 已关闭 6

行者123

个人简介

我是一名优秀的程序员,十分优秀！

作者热门文章

滴滴打车优惠券免费领取

全站热门文章

首页

博学

6Ren·AI

商城

c - 使用自修改代码观察 x86 上的陈旧指令提取