gpt4 book ai didi

c++ - 繁忙循环与 sleep (0)和暂停指令有什么不同?

转载 作者:行者123 更新时间:2023-12-02 00:04:52 26 4
gpt4 key购买 nike

我想等待我的应用程序中应该立即发生的事件,所以我不想让我的线程等待并稍后唤醒它。我想知道使用 Sleep(0) 和硬件暂停指令有什么区别。

我看不到以下程序的 CPU 利用率有任何差异。我的问题不是关于节能的考虑。

#include <iostream>
using namespace std;
#include <windows.h>

bool t = false;
int main() {
while(t == false)
{
__asm { pause } ;
//Sleep(0);
}
}

最佳答案

Windows sleep (0) 与 PAUSE 指令

让我引用《Intel 64 和 IA-32 架构优化引用手册》。

In multi-threading implementation, a popular construct in thread synchronization and for yielding scheduling quanta to another thread waiting to carry out its task is to sit in a loop and issuing SLEEP(0).

These are typically called “sleep loops” (see example #1). It should be noted that a SwitchToThread call can also be used. The “sleep loop” is common in locking algorithms and thread pools as the threads are waiting on work.

This construct of sitting in a tight loop and calling Sleep() service with a parameter of 0 is actually a polling loop with side effects:

  • Each call to Sleep() experiences the expensive cost of a context switch, which can be 10000+ cycles.
  • It also suffers the cost of ring 3 to ring 0 transitions, which can be 1000+ cycles.
  • When there is no other thread waiting to take possession of control, this sleep loop behaves to the OS as a highly active task demanding CPU resource, preventing the OS to put the CPU into a low-power state.

示例#1。未优化的 sleep 循环

while(!acquire_lock())
{ Sleep( 0 ); }
do_work();
release_lock();

示例#2。使用 PAUSE 的功耗友好型 sleep 循环

if (!acquire_lock())
{ /* Spin on pause max_spin_count times before backing off to sleep */
for(int j = 0; j < max_spin_count; ++j)
{ /* intrinsic for PAUSE instruction*/
_mm_pause();
if (read_volatile_lock())
{
if (acquire_lock()) goto PROTECTED_CODE;
}
}
/* Pause loop didn't work, sleep now */
Sleep(0);
goto ATTEMPT_AGAIN;
}
PROTECTED_CODE:
do_work();
release_lock();

Example #2 shows the technique of using PAUSE instruction to make the sleep loop power friendly.

By slowing down the “spin-wait” with the PAUSE instruction, the multi-threading software gains:

  • Performance by facilitating the waiting tasks to acquire resources more easily from a busy wait.
  • Power-savings by both using fewer parts of the pipeline while spinning.
  • Elimination of great majority of unnecessarily executed instructions caused by the overhead of a Sleep(0) call.

In one case study, this technique achieved 4.3x of performance gain, which translated to 21% power savings at the processor and 13% power savings at platform level.

Skylake 微架构中的暂停延迟

The PAUSE instruction is typically used with software threads executing on two logical processors located in the same processor core, waiting for a lock to be released. Such short wait loops tend to last between tens and a few hundreds of cycles, so performance-wise it is more beneficial to wait while occupying the CPU than yielding to the OS. When the wait loop is expected to last for thousands of cycles or more, it is preferable to yield to the operating system by calling one of the OS synchronization API functions, such as WaitForSingleObject on Windows OS.

The PAUSE instruction is intended to:

  • Temporarily provide the sibling logical processor (ready to make forward progress exiting the spin loop) with competitively shared hardware resources. The competitively-shared microarchitectural resources that the sibling logical processor can utilize in the Skylake microarchitecture are: (1) More front end slots in the Decode ICache, LSD and IDQ; (2) More execution slots in the RS.
  • Save power consumed by the processor core compared to executing equivalent spin loop instruction sequence in the following configurations: (1) One logical processor is inactive (e.g. entering a C-state); (2) Both logical processors in the same core execute the PAUSE instruction; (3) HT is disabled (e.g. using BIOS options).

The latency of PAUSE instruction in prior generation microarchitecture is about 10 cycles, whereas on Skylake microarchitecture it has been extended to as many as 140 cycles.

The increased latency (allowing more effective utilization of competitively-shared microarchitectural resources to the logical processor ready to make forward progress) has a small positive performance impact of 1-2% on highly threaded applications. It is expected to have negligible impact on less threaded applications if forward progress is not blocked on executing a fixed number of looped PAUSE instructions.

There's also a small power benefit in 2-core and 4-core systems. As the PAUSE latency has been increased significantly, workloads that are sensitive to PAUSE latency will suffer some performance loss.

您可以在《Intel 64 和 IA-32 架构优化引用手册》和《Intel 64 和 IA-32 架构软件开发人员手册》以及代码示例中找到有关此问题的更多信息。

我的意见

最好使程序逻辑的流动方式既不需要 Sleep(0) 也不需要 PAUSE 指令。换句话说,完全避免“旋转等待”循环。相反,请使用高级同步函数,例如 WaitForMultipleObjects()SetEvent() 等。这种高级同步函数是编写程序的最佳方式。如果您从性能、效率和节能方面分析可用工具(根据您的配置),则更高级别的功能是最佳选择。尽管它们还遭受昂贵的上下文切换和环 3 到环 0 的转换,但与所有“旋转等待”暂停周期组合或周期的总花费相比,这些费用并不常见,而且非常合理。与 sleep (0)。

在支持超线程的处理器上,“自旋等待”循环可能会消耗处理器执行带宽的很大一部分。执行自旋等待循环的一个逻辑处理器可能会严重影响另一逻辑处理器的性能。这就是为什么有时禁用超线程可能会提高性能,正如一些人指出的那样。

在程序逻辑工作流程中持续轮询设备或文件或状态更改可能会导致计算机消耗更多电量,给内存和总线带来压力,并产生不必要的页面错误(使用 Windows 中的任务管理器来查看哪些页面错误)应用程序在空闲状态下产生大多数页面错误,在后台等待用户输入 - 这些是效率最低的应用程序,因为它们使用上面提到的轮询)。尽可能减少轮询(包括自旋循环),并使用事件驱动的意识形态和/或框架(如果可用)——这是我强烈推荐的最佳实践。您的应用程序实际上应该一直处于休眠状态,等待预先设置的多个事件。

Nginx 是事件驱动应用程序的一个很好的例子,它最初是为类 UNIX 操作系统编写的。由于操作系统提供了各种功能和方法来通知您的应用程序,因此请使用这些通知而不是轮询设备状态更改。只需让您的程序无限休眠,直到通知到达或用户输入到达。使用这种技术可以减少代码轮询数据源状态的开销,因为当状态发生变化时,代码可以异步获取通知。

关于c++ - 繁忙循环与 sleep (0)和暂停指令有什么不同?,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/7488196/

26 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com