gpt4 book ai didi

c++ - 运行简单代码时出现巨大的延迟峰值

转载 作者:太空狗 更新时间:2023-10-29 21:09:50 24 4
gpt4 key购买 nike

我有一个简单的基准测试来演示 busywait 线程的性能。它以两种模式运行:第一种模式简单地按顺序获取两个时间点,第二种模式遍历 vector 并测量迭代的持续时间。我看到 clock::now() 的两次连续调用平均需要大约 50 纳秒,通过 vector 的一次平均迭代需要大约 100 纳秒。但有时这些操作的执行会有很大的延迟:第一种情况大约 50 微秒,第二种情况 10 毫秒 (!)

测试在单个独立核心上运行,因此不会发生上下文切换。我还在程序的开头调用了 mlockall,因此我假设页面错误不会影响性能。

还应用了以下其他优化:

  • 内核引导参数:intel_idle.max_cstate=0 idle=haltirqaffinity=0,14 isolcpus=4-13,16-27 pti=off spectre_v2=off audit=0selinux=0 nmi_watchdog=0 nosoftlockup=0 rcu_nocb_poll rcu_nocbs=19-20nohz_full=19-20;
  • rcu[^c] 内核线程移动到管家 CPU 核心 0;
  • 网卡 RxTx 队列移动到管家 CPU 内核 0;
  • 写回内核工作队列移动到管家 CPU 内核 0;
  • transparent_hugepage 已禁用;
  • 英特尔 CPU 超线程已禁用;
  • 不使用交换文件/分区。

环境:

System details:
Default Archlinux kernel:
5.1.9-arch1-1-ARCH #1 SMP PREEMPT Tue Jun 11 16:18:09 UTC 2019 x86_64 GNU/Linux

that has following PREEMPT and HZ settings:
CONFIG_HZ_300=y
CONFIG_HZ=300
CONFIG_PREEMPT=y

Hardware details:

RAM: 256GB

CPU(s): 28
On-line CPU(s) list: 0-27
Thread(s) per core: 1
Core(s) per socket: 14
Socket(s): 2
NUMA node(s): 2
Vendor ID: GenuineIntel
CPU family: 6
Model: 79
Model name: Intel(R) Xeon(R) CPU E5-2690 v4 @ 2.60GHz
Stepping: 1
CPU MHz: 3200.011
CPU max MHz: 3500.0000
CPU min MHz: 1200.0000
BogoMIPS: 5202.68
Virtualization: VT-x
L1d cache: 32K
L1i cache: 32K
L2 cache: 256K
L3 cache: 35840K
NUMA node0 CPU(s): 0-13
NUMA node1 CPU(s): 14-27

示例代码:


struct TData
{
std::vector<char> Data;

TData() = default;
TData(size_t aSize)
{
for (size_t i = 0; i < aSize; ++i)
{
Data.push_back(i);
}
}
};

using TBuffer = std::vector<TData>;

TData DoMemoryOperation(bool aPerform, const TBuffer& aBuffer, size_t& outBufferIndex)
{
if (!aPerform)
{
return TData {};
}

const TData& result = aBuffer[outBufferIndex];

if (++outBufferIndex == aBuffer.size())
{
outBufferIndex = 0;
}

return result;
}

void WarmUp(size_t aCyclesCount, bool aPerform, const TBuffer& aBuffer)
{
size_t bufferIndex = 0;
for (size_t i = 0; i < aCyclesCount; ++i)
{
auto data = DoMemoryOperation(aPerform, aBuffer, bufferIndex);
}
}

void TestCycle(size_t aCyclesCount, bool aPerform, const TBuffer& aBuffer, Measurings& outStatistics)
{
size_t bufferIndex = 0;
for (size_t i = 0; i < aCyclesCount; ++i)
{
auto t1 = std::chrono::steady_clock::now();
{
auto data = DoMemoryOperation(aPerform, aBuffer, bufferIndex);
}
auto t2 = std::chrono::steady_clock::now();
auto diff = std::chrono::duration_cast<std::chrono::nanoseconds>(t2 - t1).count();
outStatistics.AddMeasuring(diff, t2);
}
}

int Run(int aCpu, size_t aDataSize, size_t aBufferSize, size_t aCyclesCount, bool aAllocate, bool aPerform)
{
if (mlockall(MCL_CURRENT | MCL_FUTURE))
{
throw std::runtime_error("mlockall failed");
}

std::cout << "Test parameters"
<< ":\ndata size=" << aDataSize
<< ",\nnumber of elements=" << aBufferSize
<< ",\nbuffer size=" << aBufferSize * aDataSize
<< ",\nnumber of cycles=" << aCyclesCount
<< ",\nallocate=" << aAllocate
<< ",\nperform=" << aPerform
<< ",\nthread ";

SetCpuAffinity(aCpu);

TBuffer buffer;

if (aPerform)
{
buffer.resize(aBufferSize);
std::fill(buffer.begin(), buffer.end(), TData { aDataSize });
}

WaitForKey();
std::cout << "Running..."<< std::endl;

WarmUp(aBufferSize * 2, aPerform, buffer);

Measurings statistics;
TestCycle(aCyclesCount, aPerform, buffer, statistics);
statistics.Print(aCyclesCount);

WaitForKey();

if (munlockall())
{
throw std::runtime_error("munlockall failed");
}

return 0;
}

收到以下结果:第一:

StandaloneTests --run_test=MemoryAccessDelay --cpu=19 --data-size=280 --size=67108864 --count=1000000000 --allocate=1 --perform=0
Test parameters:
data size=280,
number of elements=67108864,
buffer size=18790481920,
number of cycles=1000000000,
allocate=1,
perform=0,
thread 14056 on cpu 19

Statistics: min: 16: max: 18985: avg: 18
0 - 10 : 0 (0 %): -
10 - 100 : 999993494 (99 %): min: 40: max: 117130: avg: 40
100 - 1000 : 946 (0 %): min: 380: max: 506236837: avg: 43056598
1000 - 10000 : 5549 (0 %): min: 56876: max: 70001739: avg: 7341862
10000 - 18985 : 11 (0 %): min: 1973150818: max: 14060001546: avg: 3644216650

第二个:

StandaloneTests --run_test=MemoryAccessDelay --cpu=19 --data-size=280 --size=67108864 --count=1000000000 --allocate=1 --perform=1
Test parameters:
data size=280,
number of elements=67108864,
buffer size=18790481920,
number of cycles=1000000000,
allocate=1,
perform=1,
thread 3264 on cpu 19

Statistics: min: 36: max: 4967479: avg: 48
0 - 10 : 0 (0 %): -
10 - 100 : 964323921 (96 %): min: 60: max: 4968567: avg: 74
100 - 1000 : 35661548 (3 %): min: 122: max: 4972632: avg: 2023
1000 - 10000 : 14320 (0 %): min: 1721: max: 33335158: avg: 5039338
10000 - 100000 : 130 (0 %): min: 10010533: max: 1793333832: avg: 541179510
100000 - 1000000 : 0 (0 %): -
1000000 - 4967479 : 81 (0 %): min: 508197829: max: 2456672083: avg: 878824867

任何想法是什么导致如此巨大的延误以及如何对其进行调查?

最佳答案

在:

TData DoMemoryOperation(bool aPerform, const TBuffer& aBuffer, size_t& outBufferIndex);

它返回一个 std::vector<char>按值(value)。这涉及内存分配和数据复制。内存分配可以执行系统调用( brkmmap )和 memory mappings related syscalls are notorious for being slow .

当计时包括系统调用时,不能期望低方差。

您可能希望使用 /usr/bin/time --verbose <app> 运行您的应用程序或 perf -ddd <app>查看页面错误和上下文切换的次数。

关于c++ - 运行简单代码时出现巨大的延迟峰值,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/56871923/

24 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com