gpt4 book ai didi

c - 在两个相同的Skylake Xeon Gold 6154系统上测得的不同的内核间延迟

转载 作者:行者123 更新时间:2023-12-01 23:27:06 26 4
gpt4 key购买 nike

我们一直在使用两个完全相同的软件(Centos 7 OS和BIOS设置)使用相同的Skylake服务器。除延迟性能外,其他所有内容都相同。我们的软件正在使用AVX512。

在测试中,我注意到AVX512每次都会降低其中一个系统的性能(增加延迟)。有明显的性能差异。我检查了所有内容,都一样。

我该怎么做才能解决这个问题?哪个工具可以提供帮助?

提前致谢..

sudo lshw -class cpu
*-cpu:0
description: CPU
product: Intel(R) Xeon(R) Gold 6154 CPU @ 3.00GHz
vendor: Intel Corp.
vendor_id: GenuineIntel
physical id: 400
bus info: cpu@0
version: Intel(R) Xeon(R) Gold 6154 CPU @ 3.00GHz
slot: CPU1
size: 3GHz
capacity: 4GHz
width: 64 bits
clock: 1010MHz
capabilities: lm fpu fpu_exception wp vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp x86-64 constant_tsc art arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc aperfmperf eagerfpu pni pclmulqdq dtes64 monitor ds_cpl vmx smx est tm2 ssse3 sdbg fma cx16 xtpr pdcm pcid dca sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm abm 3dnowprefetch epb cat_l3 cdp_l3 intel_ppin intel_pt ssbd mba ibrs ibpb stibp tpr_shadow vnmi flexpriority ept vpid fsgsbase tsc_adjust bmi1 hle avx2 smep bmi2 erms invpcid rtm cqm mpx rdt_a avx512f avx512dq rdseed adx smap clflushopt clwb avx512cd avx512bw avx512vl xsaveopt xsavec xgetbv1 cqm_llc cqm_occup_llc cqm_mbm_total cqm_mbm_local dtherm ida arat pln pts pku ospke md_clear spec_ctrl intel_stibp flush_l1d
configuration: cores=18 enabledcores=18 threads=18
*-cpu:1 DISABLED
description: CPU [empty]
physical id: 401
slot: CPU2

更新:在Peter发表评论之后,我添加了以下示例代码作为示例。
#include <emmintrin.h>
#include <pthread.h>
#include <immintrin.h>
#include <unistd.h>
#include <inttypes.h>
#include <string.h>
#include <stdbool.h>
#include <stdio.h>

#define CACHE_LINE_SIZE 64

/**
* Copy 64 bytes from one location to another,
* locations should not overlap.
*/
static inline __attribute__((always_inline)) void
mov64(uint8_t *dst, const uint8_t *src)
{
__m512i zmm0;

zmm0 = _mm512_load_si512((const void *)src);
_mm512_store_si512((void *)dst, zmm0);
}

#define likely(x) __builtin_expect((x), 1)
#define unlikely(x) __builtin_expect((x), 0)

static inline uint64_t rdtsc(void)
{
union {
uint64_t tsc_64;
__extension__
struct {
uint32_t lo_32;
uint32_t hi_32;
};
} tsc;

__asm__ volatile("rdtsc" :
"=a" (tsc.lo_32),
"=d" (tsc.hi_32));
return tsc.tsc_64;
}
union levels {
__m512i zmm0;
struct {
uint32_t x1;
uint64_t x2;
uint64_t x3;
uint32_t x4;
uint32_t x5;
uint32_t x6;
uint32_t x7;
};
} __attribute__((aligned(CACHE_LINE_SIZE)));

union levels g_shared;

void *worker_loop(void *param)
{
cpu_set_t cpuset;
CPU_ZERO(&cpuset);
CPU_SET(16, &cpuset);

pthread_t thread = pthread_self();

pthread_setaffinity_np(thread, sizeof(cpu_set_t), &cpuset);

union levels lshared;
uint32_t old_x1 = 0;
lshared.x1 = 0;
while (1) {
__asm__ ("" ::: "memory");

lshared.zmm0 = _mm512_load_si512((const void *)&g_shared);

if (unlikely(lshared.x1 <= old_x1)) {
continue;
} else if (unlikely(lshared.x1 != lshared.x7)) {
// printf("%u %u %u %u %u %u\n", lshared.x1, lshared.x3, lshared.x4, lshared.x5, lshared.x6, lshared.x7);
exit(EXIT_FAILURE);
} else {
uint64_t val = rdtsc();
if (val > lshared.x2) {
printf("> (%u) %lu - %lu = %lu\n", lshared.x1, val, lshared.x2, val - lshared.x2);
} else {
printf("< (%u) %lu - %lu = %lu\n", lshared.x1, lshared.x2, val, lshared.x2 - val);
}
}
old_x1 = lshared.x1;

_mm_pause();
}

return NULL;
}

int main(int argc, char *argv[])
{
cpu_set_t cpuset;
CPU_ZERO(&cpuset);
CPU_SET(15, &cpuset);

pthread_t thread = pthread_self();

memset(&g_shared, 0, sizeof(g_shared));

pthread_setaffinity_np(thread, sizeof(cpu_set_t), &cpuset);

pthread_t worker;
pthread_create(&worker, NULL, worker_loop, NULL);

uint32_t val = 1;
union levels lshared;

while (1) {
lshared.x1 = val;
lshared.x2 = rdtsc();
lshared.x3 = val;
lshared.x4 = val;
lshared.x5 = val;
lshared.x6 = val;
lshared.x7 = val;
_mm512_store_si512((void *)&g_shared, lshared.zmm0);
__asm__ ("" ::: "memory");

usleep(100000);

val++;

_mm_pause();
}

return EXIT_SUCCESS;
}

较慢的系统的输出:
> (1) 4582365777844442 - 4582365777792564 = 51878
> (2) 4582366077239290 - 4582366077238806 = 484
> (3) 4582366376674782 - 4582366376674346 = 436
> (4) 4582366676044526 - 4582366676041890 = 2636
> (5) 4582366975470562 - 4582366975470134 = 428
> (6) 4582367274899258 - 4582367274898828 = 430
> (7) 4582367574328446 - 4582367574328022 = 424
> (8) 4582367873757956 - 4582367873757532 = 424
> (9) 4582368173187886 - 4582368173187466 = 420
> (10) 4582368472618418 - 4582368472617958 = 460
> (11) 4582368772049720 - 4582368772049236 = 484
> (12) 4582369071481018 - 4582369071480594 = 424
> (13) 4582369370912760 - 4582369370912284 = 476
> (14) 4582369670344890 - 4582369670344212 = 678
> (15) 4582369969776826 - 4582369969776400 = 426
> (16) 4582370269209462 - 4582370269209024 = 438
> (17) 4582370568642626 - 4582370568642172 = 454
> (18) 4582370868076202 - 4582370868075764 = 438
> (19) 4582371167510016 - 4582371167509594 = 422
> (20) 4582371466944326 - 4582371466943892 = 434
> (21) 4582371766379206 - 4582371766378734 = 472
> (22) 4582372065814804 - 4582372065814344 = 460
> (23) 4582372365225608 - 4582372365223068 = 2540
> (24) 4582372664652112 - 4582372664651668 = 444
> (25) 4582372964080746 - 4582372964080314 = 432
> (26) 4582373263510732 - 4582373263510308 = 424
> (27) 4582373562940116 - 4582373562939676 = 440
> (28) 4582373862370284 - 4582373862369860 = 424
> (29) 4582374161800632 - 4582374161800182 = 450

更快的系统输出:
> (1) 9222001841102298 - 9222001841045386 = 56912
> (2) 9222002140513228 - 9222002140512908 = 320
> (3) 9222002439970702 - 9222002439970330 = 372
> (4) 9222002739428448 - 9222002739428114 = 334
> (5) 9222003038886492 - 9222003038886152 = 340
> (6) 9222003338344884 - 9222003338344516 = 368
> (7) 9222003637803702 - 9222003637803332 = 370
> (8) 9222003937262776 - 9222003937262404 = 372
> (9) 9222004236649320 - 9222004236648932 = 388
> (10) 9222004536101876 - 9222004536101510 = 366
> (11) 9222004835554776 - 9222004835554378 = 398
> (12) 9222005135008064 - 9222005135007686 = 378
> (13) 9222005434461868 - 9222005434461526 = 342
> (14) 9222005733916416 - 9222005733916026 = 390
> (15) 9222006033370968 - 9222006033370640 = 328
> (16) 9222006332825872 - 9222006332825484 = 388
> (17) 9222006632280956 - 9222006632280570 = 386
> (18) 9222006931736548 - 9222006931736178 = 370
> (19) 9222007231192376 - 9222007231191986 = 390
> (20) 9222007530648868 - 9222007530648486 = 382
> (21) 9222007830105642 - 9222007830105270 = 372
> (22) 9222008129562750 - 9222008129562382 = 368
> (23) 9222008429020310 - 9222008429019944 = 366
> (24) 9222008728478336 - 9222008728477970 = 366
> (25) 9222009027936696 - 9222009027936298 = 398
> (26) 9222009327395716 - 9222009327395342 = 374
> (27) 9222009626854876 - 9222009626854506 = 370
> (28) 9222009926282324 - 9222009926281936 = 388
> (29) 9222010225734832 - 9222010225734442 = 390
> (30) 9222010525187748 - 9222010525187366 = 382

更新2:在Peter回答之后,我添加了以下示例代码作为示例,以测量同一裸片上不同网状网络路径的延迟,并且答案的内容是正确的,不同的CPU具有不同的CPU间延迟。但是,在所有情况下,同一系统中的一个仍然比另一个系统慢25%。

另外我也不知道它是否会影响它,但是我才意识到慢速的CPU有额外的 md_clear 标志。

总之,我该怎么做才能解决这个问题?哪个工具可以提供帮助?我如何理解性能差异?
#include <emmintrin.h>
#include <pthread.h>
#include <immintrin.h>
#include <unistd.h>
#include <inttypes.h>
#include <string.h>
#include <stdbool.h>
#include <stdio.h>

#define CACHE_LINE_SIZE 64

/**
* Copy 64 bytes from one location to another,
* locations should not overlap.
*/
static inline __attribute__((always_inline)) void
mov64(uint8_t *dst, const uint8_t *src)
{
__m512i zmm0;

zmm0 = _mm512_load_si512((const void *)src);
_mm512_store_si512((void *)dst, zmm0);
}

#define likely(x) __builtin_expect((x), 1)
#define unlikely(x) __builtin_expect((x), 0)

static inline uint64_t rdtsc(void)
{
union {
uint64_t tsc_64;
__extension__
struct {
uint32_t lo_32;
uint32_t hi_32;
};
} tsc;

__asm__ volatile("rdtsc" :
"=a" (tsc.lo_32),
"=d" (tsc.hi_32));
return tsc.tsc_64;
}
union levels {
__m512i zmm0;
struct {
uint32_t x1;
uint64_t x2;
uint64_t x3;
uint32_t x4;
uint32_t x5;
uint32_t x6;
uint32_t x7;
};
} __attribute__((aligned(CACHE_LINE_SIZE)));

union levels g_shared;

uint32_t g_main_cpu;
uint32_t g_worker_cpu;

void *worker_loop(void *param)
{
_mm_mfence();

cpu_set_t cpuset;
CPU_ZERO(&cpuset);
CPU_SET(g_worker_cpu, &cpuset);

pthread_t thread = pthread_self();

pthread_setaffinity_np(thread, sizeof(cpu_set_t), &cpuset);

union levels lshared;
uint32_t old_x1 = 1;

uint64_t min = 10000, max = 0, sum = 0;

int i = 0;
while (i < 300) {
__asm__ ("" ::: "memory");
lshared.zmm0 = _mm512_load_si512((const void *)&g_shared);

if (unlikely(lshared.x1 <= old_x1)) {
continue;
} else if (unlikely(lshared.x1 != lshared.x7)) {
exit(EXIT_FAILURE);
} else {
uint64_t val = rdtsc();
uint64_t diff = val - lshared.x2;
sum += diff;
if (min > diff)
min = diff;

if (diff > max)
max = diff;

i++;
}
old_x1 = lshared.x1;

_mm_pause();
}

printf("(M=%u-W=%u) min=%lu max=%lu mean=%lu\n", g_main_cpu, g_worker_cpu, min, max, sum / 300);

return NULL;
}

int main(int argc, char *argv[])
{
for (int main_cpu = 2; main_cpu <= 17; ++main_cpu) {
for (int worker_cpu = 2; worker_cpu <= 17; ++worker_cpu) {
if (main_cpu == worker_cpu) {
continue;
}
_mm_mfence();

g_main_cpu = main_cpu;
g_worker_cpu = worker_cpu;

cpu_set_t cpuset;
CPU_ZERO(&cpuset);
CPU_SET(g_main_cpu, &cpuset);

pthread_t thread = pthread_self();

memset(&g_shared, 0, sizeof(g_shared));

pthread_setaffinity_np(thread, sizeof(cpu_set_t), &cpuset);

pthread_t worker;
pthread_create(&worker, NULL, worker_loop, NULL);

uint32_t val = 0;
union levels lshared;

for (int i = 0; i < 350; ++i) {
lshared.x1 = val;
lshared.x2 = rdtsc();
lshared.x3 = val;
lshared.x4 = val;
lshared.x5 = val;
lshared.x6 = val;
lshared.x7 = val;
_mm512_store_si512((void *)&g_shared, lshared.zmm0);
__asm__ ("" ::: "memory");

usleep(100000);

val++;

_mm_pause();
}

pthread_join(worker, NULL);
}
}

return EXIT_SUCCESS;
}

两个系统的输出:(2-17是隔离的cpus)
            slow cpu    fast cpu
------------------------------------

(M=2-W=3) mean=580 mean=374
(M=2-W=4) mean=463 mean=365
(M=2-W=5) mean=449 mean=391
(M=2-W=6) mean=484 mean=345
(M=2-W=7) mean=430 mean=386
(M=2-W=8) mean=439 mean=369
(M=2-W=9) mean=445 mean=376
(M=2-W=10) mean=480 mean=354
(M=2-W=11) mean=440 mean=392
(M=2-W=12) mean=475 mean=324
(M=2-W=13) mean=453 mean=373
(M=2-W=14) mean=474 mean=344
(M=2-W=15) mean=445 mean=384
(M=2-W=16) mean=468 mean=372
(M=2-W=17) mean=462 mean=373
(M=3-W=2) mean=447 mean=392
(M=3-W=4) mean=556 mean=386
(M=3-W=5) mean=418 mean=409
(M=3-W=6) mean=473 mean=372
(M=3-W=7) mean=397 mean=400
(M=3-W=8) mean=408 mean=403
(M=3-W=9) mean=412 mean=413
(M=3-W=10) mean=447 mean=389
(M=3-W=11) mean=412 mean=423
(M=3-W=12) mean=446 mean=399
(M=3-W=13) mean=427 mean=407
(M=3-W=14) mean=445 mean=390
(M=3-W=15) mean=417 mean=448
(M=3-W=16) mean=438 mean=386
(M=3-W=17) mean=435 mean=396
(M=4-W=2) mean=463 mean=368
(M=4-W=3) mean=433 mean=401
(M=4-W=5) mean=561 mean=406
(M=4-W=6) mean=468 mean=378
(M=4-W=7) mean=416 mean=387
(M=4-W=8) mean=425 mean=386
(M=4-W=9) mean=425 mean=415
(M=4-W=10) mean=464 mean=379
(M=4-W=11) mean=424 mean=404
(M=4-W=12) mean=456 mean=369
(M=4-W=13) mean=441 mean=395
(M=4-W=14) mean=460 mean=378
(M=4-W=15) mean=427 mean=405
(M=4-W=16) mean=446 mean=369
(M=4-W=17) mean=448 mean=391
(M=5-W=2) mean=447 mean=382
(M=5-W=3) mean=418 mean=406
(M=5-W=4) mean=430 mean=397
(M=5-W=6) mean=584 mean=386
(M=5-W=7) mean=399 mean=399
(M=5-W=8) mean=404 mean=386
(M=5-W=9) mean=408 mean=408
(M=5-W=10) mean=446 mean=378
(M=5-W=11) mean=411 mean=407
(M=5-W=12) mean=440 mean=385
(M=5-W=13) mean=424 mean=402
(M=5-W=14) mean=442 mean=381
(M=5-W=15) mean=411 mean=411
(M=5-W=16) mean=433 mean=398
(M=5-W=17) mean=429 mean=395
(M=6-W=2) mean=486 mean=356
(M=6-W=3) mean=453 mean=388
(M=6-W=4) mean=471 mean=353
(M=6-W=5) mean=452 mean=388
(M=6-W=7) mean=570 mean=360
(M=6-W=8) mean=444 mean=377
(M=6-W=9) mean=450 mean=376
(M=6-W=10) mean=485 mean=335
(M=6-W=11) mean=451 mean=410
(M=6-W=12) mean=479 mean=353
(M=6-W=13) mean=463 mean=363
(M=6-W=14) mean=479 mean=359
(M=6-W=15) mean=450 mean=394
(M=6-W=16) mean=473 mean=364
(M=6-W=17) mean=469 mean=373
(M=7-W=2) mean=454 mean=365
(M=7-W=3) mean=418 mean=410
(M=7-W=4) mean=443 mean=370
(M=7-W=5) mean=421 mean=407
(M=7-W=6) mean=456 mean=363
(M=7-W=8) mean=527 mean=380
(M=7-W=9) mean=417 mean=392
(M=7-W=10) mean=460 mean=361
(M=7-W=11) mean=421 mean=402
(M=7-W=12) mean=447 mean=354
(M=7-W=13) mean=430 mean=381
(M=7-W=14) mean=449 mean=375
(M=7-W=15) mean=420 mean=393
(M=7-W=16) mean=442 mean=352
(M=7-W=17) mean=438 mean=367
(M=8-W=2) mean=463 mean=382
(M=8-W=3) mean=434 mean=411
(M=8-W=4) mean=452 mean=372
(M=8-W=5) mean=429 mean=402
(M=8-W=6) mean=469 mean=368
(M=8-W=7) mean=416 mean=418
(M=8-W=9) mean=560 mean=418
(M=8-W=10) mean=468 mean=385
(M=8-W=11) mean=429 mean=394
(M=8-W=12) mean=460 mean=378
(M=8-W=13) mean=439 mean=392
(M=8-W=14) mean=459 mean=373
(M=8-W=15) mean=429 mean=383
(M=8-W=16) mean=452 mean=376
(M=8-W=17) mean=449 mean=401
(M=9-W=2) mean=440 mean=368
(M=9-W=3) mean=410 mean=398
(M=9-W=4) mean=426 mean=385
(M=9-W=5) mean=406 mean=403
(M=9-W=6) mean=447 mean=378
(M=9-W=7) mean=393 mean=427
(M=9-W=8) mean=408 mean=368
(M=9-W=10) mean=580 mean=392
(M=9-W=11) mean=408 mean=387
(M=9-W=12) mean=433 mean=381
(M=9-W=13) mean=418 mean=444
(M=9-W=14) mean=441 mean=407
(M=9-W=15) mean=408 mean=401
(M=9-W=16) mean=427 mean=376
(M=9-W=17) mean=426 mean=383
(M=10-W=2) mean=478 mean=361
(M=10-W=3) mean=446 mean=379
(M=10-W=4) mean=461 mean=350
(M=10-W=5) mean=445 mean=373
(M=10-W=6) mean=483 mean=354
(M=10-W=7) mean=428 mean=370
(M=10-W=8) mean=436 mean=355
(M=10-W=9) mean=448 mean=390
(M=10-W=11) mean=569 mean=350
(M=10-W=12) mean=473 mean=337
(M=10-W=13) mean=454 mean=370
(M=10-W=14) mean=474 mean=360
(M=10-W=15) mean=441 mean=370
(M=10-W=16) mean=463 mean=354
(M=10-W=17) mean=462 mean=358
(M=11-W=2) mean=447 mean=384
(M=11-W=3) mean=411 mean=408
(M=11-W=4) mean=433 mean=394
(M=11-W=5) mean=413 mean=428
(M=11-W=6) mean=455 mean=383
(M=11-W=7) mean=402 mean=395
(M=11-W=8) mean=407 mean=418
(M=11-W=9) mean=417 mean=424
(M=11-W=10) mean=452 mean=395
(M=11-W=12) mean=577 mean=406
(M=11-W=13) mean=426 mean=402
(M=11-W=14) mean=442 mean=412
(M=11-W=15) mean=408 mean=411
(M=11-W=16) mean=435 mean=400
(M=11-W=17) mean=431 mean=415
(M=12-W=2) mean=473 mean=352
(M=12-W=3) mean=447 mean=381
(M=12-W=4) mean=461 mean=361
(M=12-W=5) mean=445 mean=366
(M=12-W=6) mean=483 mean=322
(M=12-W=7) mean=431 mean=358
(M=12-W=8) mean=438 mean=340
(M=12-W=9) mean=448 mean=409
(M=12-W=10) mean=481 mean=334
(M=12-W=11) mean=447 mean=351
(M=12-W=13) mean=580 mean=383
(M=12-W=14) mean=473 mean=359
(M=12-W=15) mean=441 mean=385
(M=12-W=16) mean=463 mean=355
(M=12-W=17) mean=462 mean=358
(M=13-W=2) mean=450 mean=385
(M=13-W=3) mean=420 mean=410
(M=13-W=4) mean=440 mean=396
(M=13-W=5) mean=418 mean=402
(M=13-W=6) mean=461 mean=385
(M=13-W=7) mean=406 mean=391
(M=13-W=8) mean=415 mean=382
(M=13-W=9) mean=421 mean=402
(M=13-W=10) mean=457 mean=376
(M=13-W=11) mean=422 mean=409
(M=13-W=12) mean=451 mean=381
(M=13-W=14) mean=579 mean=375
(M=13-W=15) mean=430 mean=402
(M=13-W=16) mean=440 mean=408
(M=13-W=17) mean=439 mean=394
(M=14-W=2) mean=477 mean=330
(M=14-W=3) mean=449 mean=406
(M=14-W=4) mean=464 mean=355
(M=14-W=5) mean=450 mean=389
(M=14-W=6) mean=487 mean=342
(M=14-W=7) mean=432 mean=380
(M=14-W=8) mean=439 mean=360
(M=14-W=9) mean=451 mean=405
(M=14-W=10) mean=485 mean=356
(M=14-W=11) mean=447 mean=398
(M=14-W=12) mean=479 mean=338
(M=14-W=13) mean=455 mean=382
(M=14-W=15) mean=564 mean=383
(M=14-W=16) mean=481 mean=361
(M=14-W=17) mean=465 mean=351
(M=15-W=2) mean=426 mean=409
(M=15-W=3) mean=395 mean=424
(M=15-W=4) mean=412 mean=427
(M=15-W=5) mean=395 mean=425
(M=15-W=6) mean=435 mean=391
(M=15-W=7) mean=379 mean=405
(M=15-W=8) mean=388 mean=412
(M=15-W=9) mean=399 mean=432
(M=15-W=10) mean=432 mean=389
(M=15-W=11) mean=397 mean=432
(M=15-W=12) mean=426 mean=393
(M=15-W=13) mean=404 mean=407
(M=15-W=14) mean=429 mean=412
(M=15-W=16) mean=539 mean=391
(M=15-W=17) mean=414 mean=397
(M=16-W=2) mean=456 mean=368
(M=16-W=3) mean=422 mean=406
(M=16-W=4) mean=445 mean=384
(M=16-W=5) mean=427 mean=397
(M=16-W=6) mean=462 mean=348
(M=16-W=7) mean=413 mean=408
(M=16-W=8) mean=419 mean=361
(M=16-W=9) mean=429 mean=385
(M=16-W=10) mean=463 mean=369
(M=16-W=11) mean=426 mean=404
(M=16-W=12) mean=454 mean=391
(M=16-W=13) mean=434 mean=378
(M=16-W=14) mean=454 mean=412
(M=16-W=15) mean=424 mean=416
(M=16-W=17) mean=578 mean=378
(M=17-W=2) mean=460 mean=402
(M=17-W=3) mean=419 mean=381
(M=17-W=4) mean=446 mean=394
(M=17-W=5) mean=424 mean=422
(M=17-W=6) mean=468 mean=369
(M=17-W=7) mean=409 mean=401
(M=17-W=8) mean=418 mean=405
(M=17-W=9) mean=428 mean=414
(M=17-W=10) mean=459 mean=369
(M=17-W=11) mean=424 mean=387
(M=17-W=12) mean=451 mean=372
(M=17-W=13) mean=435 mean=382
(M=17-W=14) mean=459 mean=369
(M=17-W=15) mean=426 mean=401
(M=17-W=16) mean=446 mean=371

最佳答案

我的猜测:不同的Xeon Gold 6154芯片(18c 36t)具有不同的内核,它们融合在上以防止缺陷,因此您在固定到的两个内核之间和/或缓存的L3缓存 slice 之间具有不同的网状网络路径。线最终被映射到。这会影响这两个内核之间的内核间延迟。

根据Wikichip的说法,它是based on the "Extreme Core Count die" for SKX,上面有28个物理核,Xeon Platinum 8176的核数基于同一晶粒。

因此,在您的芯片上禁用了10个内核,但可能禁用了10个。这是否意味着某些内核彼此之间的跳数更大(也许)?和/或这可能意味着将以不同的顺序枚举核心,因此相同的硬编码核心编号意味着不同的网格位置。

https://en.wikichip.org/wiki/intel/mesh_interconnect_architecture

您的更新显示了所有成对内核的新数据。 似乎大多数(但不是全部)配对的CPU速度较慢。 (尽管我不完全信任该数据,如果您使用均值而不丢弃异常值。)这仍然可以通过不同的网格布局来合理地解释,大多数核心之间的距离可能更差。

这是一个2D网格,大概反映了核心的物理布局。也许快速的CPU大多在外部禁用了核心,因此 Activity 的核心被密集地封装在较小的网格中。但是,也许速度较慢的一个在网格的更多“内部”核心中存在缺陷。

我刚刚意识到慢速CPU具有额外的md_clear CPU功能标志。

根据https://software.intel.com/security-software-guidance/insights/deep-dive-intel-analysis-microarchitectural-data-sampling的指示,md_clear标志指示微码支持通过verw指令等进行L1TF /微体系结构数据采样的解决方法。

也许较新的微代码版本还进行了另一项更改,从而损害了该微基准测试的性能(也许总体而言)。也许这是一个巧合。

来自更多具有较新微码的Xeon Gold CPU的更多数据可能会有所启发。如果即使在使用相同的微代码的情况下,我们仍然看到CPU之间的差异如此之大,那将支持我的假设,即物理内核被融合为28核芯片作为18个工作核CPU出售的结果。

同样在基于较小裸片的Xeon上进行测试(例如启用所有14个内核的14核HCC裸片)可能会显示更好的最坏情况对内核间延迟。除非网状时钟与参考核心时钟成比例,否则可能需要控制不同的RDTSC,Turbo和非核心频率。

该解释完全不依赖于AVX512。在标量负载下,您是否看到相同的效果?

另外,可能会有一个小的时序差异,而没有_mm_pause的时序差异会比另一时序的效果差;也许一个核心正在看到管道核对(machine_clears.memory_ordering perf事件),而另一个不是吗?

您使用_mm_pause()进行的更新通常会排除放大实际延迟中的微小差异的情况。无论是什么原因,两者之间的差异似乎都很大。

您的CPU足够新,可以安全地假设TSC在内核之间进行了同步,并且大概两个内核都已经以max turbo运行。 (命名的CPU功能之一constant_tscinvariant_tsc明确保证了这一点,但我忘记了哪一个。另一项意味着无论内核时钟频率如何它都会以固定的参考频率进行计时。nonstop_tsc表示当内核处于内核状态时它不会停止睡着了。)

(TL:DR:我认为您的微基准测试看起来不错,并且您正在以合理的方式测量内核间的等待时间,而没有巨大的测量误差。)


我该怎么做才能解决这个问题?

你不能

如果低内核间等待时间对于一个应用程序至关重要,请尝试使用几个不同的CPU,直到发现延迟时间低于平均水平的CPU。

在Xeons上运行其他应用程序时延迟更短。

或者,如果我的假设是正确的,也许可以根据“高核心数”芯片获得14核Xeon Gold。启用所有14个内核后,最好的情况是。但是那些Xeons只有1个AVX512 FMA单元。

哪个工具可以提供帮助?

如果只有几个线程需要紧密耦合,请在您拥有的CPU上找到彼此之间具有最低延迟的物理内核集群。将对延迟最敏感的线程固定到这些内核。

如果这适用于您的应用程序,则可以考虑基于4个物理核心的CCX单元的Zen或Zen2微体系结构,该群集内部的延迟低,但跨群集的延迟明显更差。 AMD确实有一些多核芯片,但是只有Zen2在其加载/存储和执行单元中具有完整的256位SIMD宽度。 (它仍然不能使用AVX512,但是如果您的应用程序可以大量使用SIMD,那么您可能至少需要全速AVX2 + FMA)。

我如何理解性能差异?

如果我的假设正确,那是制造和销售的CPU的固有属性。英特尔设计了带有n物理内核的芯片。如果制造缺陷破坏了其中一些核心,他们仍可以将其作为核心数量较少的SKU出售。 (它们会烧掉物理保险丝,因此禁用的内核不会浪费电源)。大概它的网格节点仍然必须工作,除非它们可以短路经过整个节点以加强网格?

当产量高于他们想要出售的价格最高的核心数量SKU的需求时,它们将禁用某些工作的核心以及芯片上有缺陷的核心。但这通常是通过激光熔断器实现的,而不仅仅是像旧GPU中那样的固件,您有时可以破解固件以激活禁用的内核。因此,您实际上无能为力。

购买具有所有裸片上启用的内核的芯片(例如,“Extreme”内核数量为Xeon的28个内核)将意味着没有熔断的内核。到目前为止,这可能为我们提供了一些有趣的测试数据,包括最差情况的内核间延迟。

启用了所有内核的内核数较少的模具也可能很有趣。 https://en.wikichip.org/wiki/Category:microprocessor_models_by_intel_based_on_skylake_high_core_count_die页面显示“高”内核数(HCC)SKX内核具有14个内核(ECC内核的一半)。使用该模具的顶级模型是Xeon Gold 5120,即14c / 28t模型。 (每个内核有1个512位FMA单元,而不是2个)。 Intel Ark confirms

如果HCC芯片的每个内核只有1个FMA单元,这与ECC芯片包括5端口512位FMA单元的ECC芯片不同,我不会感到惊讶。这样可以节省英特尔出售的所有中端SKU的芯片面积,而拥有第二个FMA单元仅有助于处理AVX512代码。许多代码没有使用AVX512。 (在这些CPU的端口0 /端口1上,AVX2和AVX512 256位FMA吞吐量仍为2 /时钟。)

关于c - 在两个相同的Skylake Xeon Gold 6154系统上测得的不同的内核间延迟,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/57670764/

26 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com