- html - 出于某种原因,IE8 对我的 Sass 文件中继承的 html5 CSS 不友好?
- JMeter 在响应断言中使用 span 标签的问题
- html - 在 :hover and :active? 上具有不同效果的 CSS 动画
- html - 相对于居中的 html 内容固定的 CSS 重复背景?
在最近的CPU(至少最近十年左右)上,英特尔提供了三个固定功能的硬件性能计数器,以及各种可配置的性能计数器。三个固定计数器是:
INST_RETIRED.ANY
CPU_CLK_UNHALTED.THREAD
CPU_CLK_UNHALTED.REF_TSC
This event counts the number of reference cycles at the TSC rate when the core is not in a halt state and not in a TM stop-clock state. The core enters the halt state when it is running the HLT instruction or the MWAIT instruction. This event is not affected by core frequency changes (e.g., P states) but counts at the same frequency as the time stamp counter. This event can approximate elapsed time while the core was not in a halt state and not in a TM stopclock state.
rdstc
读取的自由运行的TSC值相同,因为它们仅应在停止的循环指令或“TM停止时钟状态”是不同的情况下才发散。
for (int i = 0; i < 100; i++) {
PFC_CNT cnt[7] = {};
int64_t start = nanos();
PFCSTART(cnt);
int64_t tsc =__rdtsc();
busy_loop(CALIBRATION_LOOPS);
PFCEND(cnt);
int64_t tsc_delta = __rdtsc() - tsc;
int64_t nanos_delta = nanos() - start;
printf(CPU_W "d" REF_W ".2f" TSC_W ".2f" MHZ_W ".2f" RAT_W ".6f\n",
sched_getcpu(),
1000.0 * cnt[PFC_FIXEDCNT_CPU_CLK_REF_TSC] / nanos_delta,
1000.0 * tsc_delta / nanos_delta,
1000.0 * CALIBRATION_LOOPS / nanos_delta,
1.0 * cnt[PFC_FIXEDCNT_CPU_CLK_REF_TSC]/tsc_delta);
}
busy_loop(CALIBRATION_LOOPS);
,它只是 Volatile 存储的紧密循环,通过
gcc
和
clang
的
as compiled在最新的硬件上每次迭代执行一次:
void busy_loop(uint64_t iters) {
volatile int sink;
do {
sink = 0;
} while (--iters > 0);
(void)sink;
}
PFCSTART
和
PFCEND
命令使用
libpfc读取
CPU_CLK_UNHALTED.REF_TSC
计数器。
__rdtsc()
是通过
rdtsc
指令读取TSC的内部函数。最后,我们使用
nanos()
来测量实时时间,它很简单:
int64_t nanos() {
auto t = std::chrono::high_resolution_clock::now();
return std::chrono::time_point_cast<std::chrono::nanoseconds>(t).time_since_epoch().count();
}
cpuid
,并且事情不会以确切的方式交错,但是校准循环是一整秒,因此这种纳秒级问题几乎没有被稀释。
CPU# REF_TSC rdtsc Eff Mhz Ratio
0 2392.05 2591.76 2981.30 0.922946
0 2381.74 2591.79 3032.86 0.918955
0 2399.12 2591.79 3032.50 0.925660
0 2385.04 2591.79 3010.58 0.920230
0 2378.39 2591.79 3010.21 0.917663
0 2355.84 2591.77 2928.96 0.908970
0 2364.99 2591.79 2942.32 0.912492
0 2339.64 2591.77 2935.36 0.902720
0 2366.43 2591.79 3022.08 0.913049
0 2401.93 2591.79 3023.52 0.926747
0 2452.87 2591.78 3070.91 0.946400
0 2350.06 2591.79 2961.93 0.906733
0 2340.44 2591.79 2897.58 0.903020
0 2403.22 2591.79 2944.77 0.927246
0 2394.10 2591.79 3059.58 0.923723
0 2359.69 2591.78 2957.79 0.910449
0 2353.33 2591.79 2916.39 0.907992
0 2339.58 2591.79 2951.62 0.902690
0 2395.82 2591.79 3017.59 0.924389
0 2353.47 2591.79 2937.82 0.908047
REF_TSC
是固定的TSC性能计数器,而
rdtsc
是
rdtsc
指令的结果。
Eff Mhz
是在该时间间隔内有效计算出的真实CPU频率,主要出于好奇的考虑而显示,并可以快速确认启动了多少turbo。
Ratio
是
REF_TSC
和
rdtsc
列的比率。我希望它非常接近于1,但实际上我们看到它在0.90到0.92之间徘徊,并且有很大的差异(在其他运行中我发现它低至0.8)。
rdstc
调用返回的结果几乎准确1,而PMU TSC计数器无处不在,有时几乎低至2300 MHz。
CPU# REF_TSC rdtsc Eff Mhz Ratio
0 2592.26 2592.25 2588.30 1.000000
0 2592.26 2592.26 2591.11 1.000000
0 2592.26 2592.26 2590.40 1.000000
0 2592.25 2592.25 2590.43 1.000000
0 2592.26 2592.26 2590.75 1.000000
0 2592.26 2592.26 2590.05 1.000000
0 2592.25 2592.25 2590.04 1.000000
0 2592.24 2592.24 2590.86 1.000000
0 2592.25 2592.25 2590.35 1.000000
0 2592.25 2592.25 2591.32 1.000000
0 2592.25 2592.25 2590.63 1.000000
0 2592.25 2592.25 2590.87 1.000000
0 2592.25 2592.25 2590.77 1.000000
0 2592.25 2592.25 2590.64 1.000000
0 2592.24 2592.24 2590.30 1.000000
0 2592.23 2592.23 2589.64 1.000000
0 2592.23 2592.23 2590.83 1.000000
0 2592.23 2592.23 2590.49 1.000000
0 2592.23 2592.23 2590.78 1.000000
0 2592.23 2592.23 2590.84 1.000000
0 2592.22 2592.22 2588.80 1.000000
hlt
或
mwait
指令,当然也没有暗示变化超过10%的内容。我不能肯定地说什么是“TM停止时钟周期”,但是我敢打赌它们是“热管理停止时钟周期”,这是一种用于在达到最高温度时临时限制CPU的技巧。但是,我查看了集成热敏电阻的读数,却从未见过CPU断裂60C,远低于90°C-100C引起术语管理的地方(我认为)。
2591.97 MHz
-迭代之后的迭代。然后发生了一些变化,我不确定是什么,而且
rdstc
结果大约有0.1%的微小差异。一种可能是逐步时钟调整,这是由Linux计时子系统进行的,以使本地晶体导出的时间与
ntpd
确定的时间保持一致。也许,这仅仅是晶体漂移-上面的最后一张图显示了
rdtsc
的测量周期每秒稳定增加。
最佳答案
TL; DR
您在RDTSC
和REFTSC
之间观察到的差异是由于TurboBoost P状态转换引起的。在这些转换过程中,包括固定功能性能计数器REF_TSC
在内的大多数内核都将暂停约20000-21000个周期(8.5us),但是rdtsc
会以其不变频率继续运行。 rdtsc
可能在隔离的电源和时钟域中,因为它是如此重要,并且由于其记录在墙上的类似时钟的行为。RDTSC-REFTSC
差异
差异表现为RDTSC
高估REFTSC
的趋势。程序运行的时间越长,RDTSC-REFTSC
的差异往往越明显。在很长的一段时间内,它的安装率可高达1%-2%甚至更高。
当然,您自己已经观察到,禁用TurboBoost时,过度计数会消失,使用intel_pstate
时可以按照以下步骤进行操作:
echo 1 > /sys/devices/system/cpu/intel_pstate/no_turbo
REFTSC
停止的条件之一。另一方面,TM2不会控制时钟。它仅缩放频率。
libpfc()
,使我能够读取特定的MSR,特别是
IA32_PACKAGE_THERM_STATUS
和
IA32_THERM_STATUS
MSR。两者都包含针对各种热条件的只读状态和可读写的硬件粘性日志标志:
(The
IA32_PACKAGE_THERM_STATUS
register is substantially the same)
RDTSC
过度计数无关,而odt_code过度计数会可靠地发生,而与热状态无关。
IA32_THREAD_STALL
:计算由于此逻辑处理器上的强制空转而停止的周期数。 MSR_CORE_HDC_RESIDENCY
:与上面相同,但是对于物理处理器,当该内核的一个或多个逻辑处理器强制空闲时,对周期进行计数。 MSR_PKG_HDC_SHALLOW_RESIDENCY
:计算程序包处于C2状态并且至少一个逻辑处理器处于强制空闲状态的周期。 MSR_PKG_HDC_DEEP_RESIDENCY
:计算软件包处于更深的C状态(精确地是可配置的)并且至少一个逻辑处理器处于强制空闲状态的周期。 MSR_CORE_PERF_LIMIT_REASONS
。该寄存器报告了大量非常有用的状态和粘性日志位:
690H MSR_CORE_PERF_LIMIT_REASONS - Package - Indicator of Frequency Clipping in Processor Cores
- Bit
0
: PROCHOT Status- Bit
1
: Thermal Status- Bit
4
: Graphics Driver Status. When set, frequency is reduced below the operating system request due to Processor Graphics driver override.- Bit
5
: Autonomous Utilization-Based Frequency Control Status. When set, frequency is reduced below the operating system request because the processor has detected that utilization is low.- Bit
6
: Voltage Regulator Thermal Alert Status. When set, frequency is reduced below the operating system request due to a thermal alert from the Voltage Regulator.- Bit
8
: Electrical Design Point Status. When set, frequency is reduced below the operating system request due to electrical design point constraints (e.g. maximum electrical current consumption).- Bit
9
: Core Power Limiting Status. When set, frequency is reduced below the operating system request due to domain-level power limiting.- Bit
10
: Package-Level Power Limiting PL1 Status. When set, frequency is reduced below the operating system request due to package-level power limiting PL1.- Bit
11
: Package-Level Power Limiting PL2 Status. When set, frequency is reduced below the operating system request due to package-level power limiting PL2.- Bit
12
: Max Turbo Limit Status. When set, frequency is reduced below the operating system request due to multi-core turbo limits.- Bit
13
: Turbo Transition Attenuation Status. When set, frequency is reduced below the operating system request due to Turbo transition attenuation. This prevents performance degradation due to frequent operating ratio changes.- Bit
16
: PROCHOT Log- Bit
17
: Thermal Log- Bit
20
: Graphics Driver Log- Bit
21
: Autonomous Utilization-Based Frequency Control Log- Bit
22
: Voltage Regulator Thermal Alert Log- Bit
24
: Electrical Design Point Log- Bit
25
: Core Power Limiting Log- Bit
26
: Package-Level Power Limiting PL1 Log- Bit
27
: Package-Level Power Limiting PL2 Log- Bit
28
: Max Turbo Limit Log- Bit
29
: Turbo Transition Attenuation Log
pfc.ko
现在支持此MSR,并且
demo打印这些日志位中的哪个处于 Activity 状态。
pfc.ko
驱动程序在每次读取时清除粘性位。
RDTSC-REFTSC
差异的存在完全相关,但最后一位给了我深思。 Turbo过渡衰减的仅
存在意味着切换P状态具有相当大的成本,必须通过某种滞后机制来限制速率。当我找不到计算这些转换的MSR时,我决定做下一件最好的事情-我将使用
RDTSC-REFTSC
overcount的大小来表征TurboBoost转换对性能的影响。
intel_pstate
驱动程序使我们的程序包性能不低于98%且不高于100%;这限制了处理器在第二高和最高P状态(3.3 GHz和3.4 GHz)之间振荡。我这样做如下:
echo 0 > /sys/devices/system/cpu/cpu1/online
echo 0 > /sys/devices/system/cpu/cpu2/online
echo 0 > /sys/devices/system/cpu/cpu4/online
echo 0 > /sys/devices/system/cpu/cpu5/online
echo 0 > /sys/devices/system/cpu/cpu6/online
echo 0 > /sys/devices/system/cpu/cpu7/online
echo 98 > /sys/devices/system/cpu/intel_pstate/min_perf_pct
echo 100 > /sys/devices/system/cpu/intel_pstate/max_perf_pct
1000, 1500, 2500, 4000, 6300,
10000, 15000, 25000, 40000, 63000,
100000, 150000, 250000, 400000, 630000,
1000000, 1500000, 2500000, 4000000, 6300000,
10000000, 15000000, 25000000, 40000000, 63000000
add_calibration()
的纳秒数(将上面的数字乘以2.4得到
add_calibration()
的实际参数)。
CPU 0, measured CLK_REF_TSC MHz : 2392.56
CPU 0, measured rdtsc MHz : 2392.46
CPU 0, measured add MHz : 3286.30
CPU 0, measured XREF_CLK time (s) : 0.00018200
CPU 0, measured delta time (s) : 0.00018258
CPU 0, measured tsc_delta time (s) : 0.00018200
CPU 0, ratio ref_tsc :ref_xclk : 24.00131868
CPU 0, ratio ref_core:ref_xclk : 33.00071429
CPU 0, ratio rdtsc :ref_xclk : 24.00032967
CPU 0, core CLK cycles in OS : 0
CPU 0, User-OS transitions : 0
CPU 0, rdtsc-reftsc overcount : -18
CPU 0, MSR_IA32_PACKAGE_THERM_STATUS : 000000008819080a
CPU 0, MSR_IA32_PACKAGE_THERM_INTERRUPT: 0000000000000003
CPU 0, MSR_CORE_PERF_LIMIT_REASONS : 0000000018001000
PROCHOT
Thermal
Graphics Driver
Autonomous Utilization-Based Frequency Control
Voltage Regulator Thermal Alert
Electrical Design Point (e.g. Current)
Core Power Limiting
Package-Level PL1 Power Limiting
* Package-Level PL2 Power Limiting
* Max Turbo Limit (Multi-Core Turbo)
Turbo Transition Attenuation
CPU 0, measured CLK_REF_TSC MHz : 2392.63
CPU 0, measured rdtsc MHz : 2392.62
CPU 0, measured add MHz : 3288.03
CPU 0, measured XREF_CLK time (s) : 0.00018192
CPU 0, measured delta time (s) : 0.00018248
CPU 0, measured tsc_delta time (s) : 0.00018192
CPU 0, ratio ref_tsc :ref_xclk : 24.00000000
CPU 0, ratio ref_core:ref_xclk : 32.99983509
CPU 0, ratio rdtsc :ref_xclk : 23.99989006
CPU 0, core CLK cycles in OS : 0
CPU 0, User-OS transitions : 0
CPU 0, rdtsc-reftsc overcount : -2
CPU 0, MSR_IA32_PACKAGE_THERM_STATUS : 000000008819080a
CPU 0, MSR_IA32_PACKAGE_THERM_INTERRUPT: 0000000000000003
CPU 0, MSR_CORE_PERF_LIMIT_REASONS : 0000000018001000
PROCHOT
Thermal
Graphics Driver
Autonomous Utilization-Based Frequency Control
Voltage Regulator Thermal Alert
Electrical Design Point (e.g. Current)
Core Power Limiting
Package-Level PL1 Power Limiting
* Package-Level PL2 Power Limiting
* Max Turbo Limit (Multi-Core Turbo)
Turbo Transition Attenuation
CPU 0, measured CLK_REF_TSC MHz : 2284.69
CPU 0, measured rdtsc MHz : 2392.63
CPU 0, measured add MHz : 3151.99
CPU 0, measured XREF_CLK time (s) : 0.00018121
CPU 0, measured delta time (s) : 0.00019036
CPU 0, measured tsc_delta time (s) : 0.00018977
CPU 0, ratio ref_tsc :ref_xclk : 24.00000000
CPU 0, ratio ref_core:ref_xclk : 33.38540919
CPU 0, ratio rdtsc :ref_xclk : 25.13393301
CPU 0, core CLK cycles in OS : 0
CPU 0, User-OS transitions : 0
CPU 0, rdtsc-reftsc overcount : 20548
CPU 0, MSR_IA32_PACKAGE_THERM_STATUS : 000000008819080a
CPU 0, MSR_IA32_PACKAGE_THERM_INTERRUPT: 0000000000000003
CPU 0, MSR_CORE_PERF_LIMIT_REASONS : 0000000018000000
PROCHOT
Thermal
Graphics Driver
Autonomous Utilization-Based Frequency Control
Voltage Regulator Thermal Alert
Electrical Design Point (e.g. Current)
Core Power Limiting
Package-Level PL1 Power Limiting
* Package-Level PL2 Power Limiting
* Max Turbo Limit (Multi-Core Turbo)
Turbo Transition Attenuation
CPU 0, measured CLK_REF_TSC MHz : 2392.46
CPU 0, measured rdtsc MHz : 2392.45
CPU 0, measured add MHz : 3287.80
CPU 0, measured XREF_CLK time (s) : 0.00018192
CPU 0, measured delta time (s) : 0.00018249
CPU 0, measured tsc_delta time (s) : 0.00018192
CPU 0, ratio ref_tsc :ref_xclk : 24.00000000
CPU 0, ratio ref_core:ref_xclk : 32.99978012
CPU 0, ratio rdtsc :ref_xclk : 23.99989006
CPU 0, core CLK cycles in OS : 0
CPU 0, User-OS transitions : 0
CPU 0, rdtsc-reftsc overcount : -2
CPU 0, MSR_IA32_PACKAGE_THERM_STATUS : 000000008819080a
CPU 0, MSR_IA32_PACKAGE_THERM_INTERRUPT: 0000000000000003
CPU 0, MSR_CORE_PERF_LIMIT_REASONS : 0000000018001000
PROCHOT
Thermal
Graphics Driver
Autonomous Utilization-Based Frequency Control
Voltage Regulator Thermal Alert
Electrical Design Point (e.g. Current)
Core Power Limiting
Package-Level PL1 Power Limiting
* Package-Level PL2 Power Limiting
* Max Turbo Limit (Multi-Core Turbo)
Turbo Transition Attenuation
Saturated Blue Dots: 0 standard deviations (close to mean)
Saturated Red Dots: +3 standard deviations (above mean)
Saturated Green Dots: -3 standard deviations (below mean)
24.00,33.00,24.00,-14,0,0
24.00,33.00,24.00,-20,0,0
24.00,33.00,24.00,-4,3639,1
24.00,33.00,24.00,-20,0,0
24.00,33.00,24.00,10,0,0
24.00,33.00,24.00,10,0,0
24.00,33.00,24.00,-14,0,0
24.00,33.00,24.00,-14,0,0
24.00,33.00,24.00,10,0,0
24.00,33.00,24.00,-44,0,0
24.00,33.00,24.00,10,0,0
24.00,33.00,24.00,-14,0,0
24.00,33.00,24.00,-20,0,0
24.00,33.00,24.00,10,0,0
24.00,33.00,24.00,10,0,0
24.00,33.00,24.00,10,0,0
24.00,33.00,24.00,10,0,0
24.00,33.00,24.00,-20,0,0
24.00,33.00,24.00,10,0,0
24.00,33.00,24.00,10,0,0
24.00,33.00,24.00,10,0,0
24.00,33.00,24.00,10,0,0
24.00,33.00,24.00,12,0,0
24.00,33.00,24.00,10,0,0
24.00,33.00,24.00,10,0,0
24.00,33.00,24.00,10,0,0
24.00,33.00,24.00,10,0,0
24.00,33.00,24.00,10,0,0
24.00,33.00,24.00,10,0,0
24.00,33.00,24.00,-20,0,0
24.00,33.00,24.00,32,3171,1
24.00,33.00,24.00,-20,0,0
24.00,33.00,24.00,10,0,0
RDTSC
与
REFTSC
同步计数,是
REF_XCLK
(100 MHz)的速率的24倍,可忽略的超计数,通常在内核中花费0个周期,因此0过渡到内核。内核中断需要大约3000个引用周期来提供服务。
24.00,33.00,24.00,-2,0,0
24.00,33.00,24.00,-2,0,0
24.00,33.00,24.00,2,0,0
24.00,33.00,24.00,22,0,0
24.00,33.00,24.00,-2,0,0
24.00,33.00,24.00,-2,0,0
24.00,33.00,24.00,-2,0,0
24.00,33.05,25.11,20396,0,0
24.00,33.38,25.12,20212,0,0
24.00,33.39,25.12,20308,0,0
24.00,33.42,25.12,20296,0,0
24.00,33.43,25.11,20158,0,0
24.00,33.43,25.11,20178,0,0
24.00,33.00,24.00,-4,0,0
24.00,33.00,24.00,20,3920,1
24.00,33.00,24.00,-2,0,0
24.00,33.00,24.00,-4,0,0
24.00,33.44,25.13,20396,0,0
24.00,33.46,25.11,20156,0,0
24.00,33.46,25.12,20268,0,0
24.00,33.41,25.12,20322,0,0
24.00,33.40,25.11,20216,0,0
24.00,33.46,25.12,20168,0,0
24.00,33.00,24.00,-2,0,0
24.00,33.00,24.00,-2,0,0
24.00,33.00,24.00,-2,0,0
24.00,33.00,24.00,22,0,0
24.00,33.75,24.45,20166,0,0
24.00,33.78,24.45,20302,0,0
24.00,33.78,24.45,20202,0,0
24.00,33.68,24.91,41082,0,0
24.00,33.31,24.90,40998,0,0
24.00,33.70,25.30,58986,3668,1
24.00,33.74,24.42,18798,0,0
24.00,33.74,24.45,20172,0,0
24.00,33.77,24.45,20156,0,0
24.00,33.78,24.45,20258,0,0
24.00,33.78,24.45,20240,0,0
24.00,33.77,24.42,18826,0,0
24.00,33.75,24.45,20372,0,0
24.00,33.76,24.42,18798,4081,1
24.00,33.74,24.41,18460,0,0
24.00,33.75,24.45,20234,0,0
24.00,33.77,24.45,20284,0,0
24.00,33.78,24.45,20150,0,0
24.00,33.78,24.45,20314,0,0
24.00,33.78,24.42,18766,0,0
24.00,33.71,25.36,61608,0,0
24.00,33.76,24.45,20336,0,0
24.00,33.78,24.45,20234,0,0
24.00,33.78,24.45,20210,0,0
24.00,33.78,24.45,20210,0,0
24.00,33.00,24.00,-10,0,0
24.00,33.00,24.00,4,0,0
24.00,33.00,24.00,18,0,0
24.00,33.00,24.00,2,4132,1
24.00,33.00,24.00,44,0,0
RDTSC-REFTSC
中的差异。此差异可用于确定从3.3 GHz到3.4 GHz的TurboBoost状态转换大约需要20500个引用时钟周期(8.5us),并且在输入
add_reference()
后不迟于大约250000 ns(250us; 600000个引用时钟周期)触发,当处理器认为工作量足够紧张以至于不应该进行频率-电压缩放时。
关于performance - 在英特尔上丢了周期? rdtsc和CPU_CLK_UNHALTED.REF_TSC之间不一致,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/45472147/
我是一名优秀的程序员,十分优秀!