gpt4 book ai didi

Simple Program that Implements a Multi-thread Example from a Tutorial Runs Slower than Single Threaded Implementation(实现教程中的多线程示例的简单程序运行速度比单线程实现慢)

转载 作者:bug小助手 更新时间:2023-10-26 21:09:18 24 4
gpt4 key购买 nike



I tried following the first example from the tutorial here -> YouTube Link.

我试着按照教程中的第一个例子-> YouTube链接。


The code for the example is:

该示例的代码为:



#include <iostream>
#include <thread>
#include <chrono>
#include <algorithm>


unsigned long long OddSum = 0;
unsigned long long EvenSum = 0;

void findEven(unsigned long long start, unsigned long long end){
for (unsigned long long i = start; i <= end; i++){
if((i & 1)==0){
EvenSum += i;
}
}
}


void findOdd(unsigned long long start, unsigned long long end){
for (unsigned long long i = start; i <= end; i++){
if((i & 1)==1){
OddSum += i;
}
}
}

int main(){
unsigned long long start = 0, end = 1900000000;

auto startTime = std::chrono::high_resolution_clock::now();

std::thread t1(findEven, start, end);
std::thread t2(findOdd, start, end);

t1.join();
t2.join();

// findOdd(start, end);
// findEven(start, end);


auto stopTime = std::chrono::high_resolution_clock::now();
auto duration = std::chrono::duration_cast<std::chrono::microseconds>(stopTime - startTime);

std::cout << "OddSum : " << OddSum << std::endl;
std::cout << "EvenSum : " << EvenSum << std::endl;

std::cout << "Sec : " << duration.count()/1000000 << std::endl;


return 0;
}

When I run the single thread version it takes 3 seconds in total but the multithread version takes 18 seconds. I'm not sure what's wrong. Other posts have mentioned that threads are useless on a single core machine but I'm running on a multi-core CPU. Is there something obvious I'm overlooking? Thank you!

当我运行单线程版本时,它总共需要3秒,但多线程版本需要18秒。我不知道怎么了。其他帖子提到线程在单核机器上是无用的,但我在多核CPU上运行。我是不是忽略了什么?谢谢大家!


Tried the solution as shown in the tutorial and was expecting the multi-threaded version to be faster than the single thread implementation. The actual outcome was that the multi-threaded version took longer.

尝试了教程中所示的解决方案,并希望多线程版本比单线程实现更快。实际的结果是多线程版本需要更长的时间。


更多回答

Are you compiling with optimizations enabled? If not, then any performance measurements you take are essentially invalid.

您是否在编译时启用了优化?如果不是,那么您进行的任何性能测量基本上都是无效的。

Threads are not free to use, they have overhead. They take time to create and start. Also this code doesn't even use threads to run faster it just shows you how to start and join with them. Actually multithreading code would split the loops into smaller parts and assign those to threads.

线程不是免费使用的,它们有开销。它们需要时间来创造和启动。此外,这段代码甚至没有使用线程来提高运行速度,它只是向您展示了如何启动和连接线程。实际上,多线程代码会将循环拆分成更小的部分,并将这些部分分配给线程。

@MilesBudnek That was it. I turned of debugging mode and it did it in half the time.

@MilesBudnek就是这样。我关闭了调试模式,它只用了一半的时间就完成了。

Thank you @PepijnKramer too for your input.

也感谢您@PepjnKramer的输入。

优秀答案推荐

Problem: Cache coherence


The reason the threaded program runs more slowly than the
sequential program is the cost of the
cache coherence
protocol.

线程化程序运行得比顺序程序慢的原因是高速缓存一致性协议的成本。


In a multi-core CPU, each core has its own
Level 1 (L1) Cache, which is
fast storage that the core attempts to use to satisfy memory reads and
writes. Memory accesses that miss the L1 cache have to go to a slower
resource such as L2 or L3, or even main memory, although on a typical
multi-core CPU, at least
L3 will be shared across cores.

在多核CPU中,每个核都有自己的一级(L1)缓存,这是内核尝试使用的快速存储,以满足内存读取和写入。未命中L1缓存的内存访问必须转到较慢的资源,如L2或L3,甚至主内存,尽管在典型的多核CPU上,至少L3将在内核之间共享。


In the program in the question, one thread is accessing each of these
global variables:

在问题中的程序中,有一个线程访问以下每个全局变量:


unsigned long long OddSum = 0;
unsigned long long EvenSum = 0;

A typical C++ compiler will put these two variables next to each other
in memory. That means they are likely to share a
cache line,
meaning that if one variable is loaded into the L1 cache of one core,
then both are in that same core's L1. But a given cache line cannot be
in both cores' caches at the same time, since then the computation would
become inconsistent due to having two copies of each variable. (Note
that the precise notion of consistency depends on the
memory ordering
guarantees provided by the processor architecture.)

典型的C++编译器会在内存中将这两个变量放在一起。这意味着它们可能共享一条缓存线,也就是说,如果一个变量被加载到一个核的L1缓存中,那么两个变量都在同一个核的L1中。但是给定的高速缓存线不能同时在两个核心的高速缓存中,因为这样计算将由于每个变量有两个副本而变得不一致。(请注意,一致性的确切概念取决于处理器体系结构提供的内存排序保证。)


When thread 1 (T1) starts, it pulls the cache line containing both
variables into its cache. When thread 2 (T2) begins executing, it needs
them, so the cache coherence protocol engages, forcibly evicting the
data from the T1 cache. Approximately one nanosecond later, T1 once
again needs the data, so again coherence kicks in, and the data is
evicted from T2's cache. The two cores fight over the cache line
containing the data elements continuously for the entire computation,
and all that fighting is potentially much slower than letting one core
do all of the work.

当线程1(T1)启动时,它将包含这两个变量的缓存线拉入其缓存。当线程2(T2)开始执行时,它需要它们,因此高速缓存一致性协议启动,强制从T1高速缓存中逐出数据。大约一纳秒后,T1再次需要数据,因此再次启动一致性,并将数据从T2‘S缓存中逐出。在整个计算过程中,两个核心不断地争夺包含数据元素的缓存线,所有这些争夺都可能比让一个核心完成所有工作慢得多。


On my system (cygwin g++ 11.4.0 on Windows 10, 4-core i5-6600K CPU @
3.50GHz, median of 5 measurements, +/- is interquartile), with no
compiler optimizations, I measure:

在我的系统上(Windows 10上的cygwin g++11.4.0,四核i5-6600K CPU@3.50 GHz,5个测量值的中位数,+/-是四分位数),在没有编译器优化的情况下,我测量:



  • Original program, single thread: 6489 ms +/- 28 ms

  • Original program, two threads: 6836 ms +/- 694 ms


The threaded version is not that much slower, but it's definitely not
faster, and the high dispersion is also a consequence of the cache
coherence mechanism being unhappy.

线程版本并没有慢很多,但绝对没有更快,而且高分散性也是高速缓存一致性机制不满意的结果。


Solution: Separating the data


We can fix the coherence problem by moving the two variables far enough
apart that they do not share a cache line. Then the variables needed
by each core can live in their respective L1s without interference.

我们可以通过将两个变量分开足够远以使它们不共享高速缓存线来解决一致性问题。然后,每个核心所需的变量可以不受干扰地存在于其各自的L1中。


To separate the variables, as a proof of concept, it is often sufficient
to declare another variable in between:

为了分离变量,作为概念证明,通常只需在其间声明另一个变量就足够了:


unsigned long long OddSum = 0;
char separation[32]; // <---- added
unsigned long long EvenSum = 0;

To confirm this solves the problem, here are the measurements:

为了确认这解决了问题,下面是测量结果:



  • 32 bytes separation, single thread: 6455 ms +/- 28 ms

  • 32 bytes separation, two threads: 3276 ms +/- 22 ms


On my system, less than 32 bytes of separation is not enough. By
printing out the addresses of OddSum and EvenSum, I can see that 32
bytes of separation causes the variables to be placed by the compiler 64
bytes apart. Since (in my case) OddSum is page-aligned, that means my
cache line size is 64 bytes. (Under Linux, that value is also visible
in /proc/cpuinfo.)

在我的系统上,少于32个字节的分隔是不够的。通过打印OddSum和EvenSum的地址,我可以看到32字节的分隔导致编译器将变量放置在64字节的位置。由于(在我的情况下)OddSum是页对齐的,这意味着我的缓存行大小是64字节。(在Linux下,该值也可以在/proc/cpuinfo中看到。


Red herring: Enabling compiler optimizations


Now let's return to the original program and add compiler optimizations:

现在让我们回到原始程序并添加编译器优化:



  • Original program, single thread, -O2: 2164 ms +/- 26 ms

  • Original program, two threads, -O2: 1340 ms +/- 60 ms


It looks like this solved the problem, right? But what did the
optimizer actually do? Let's look at the output of
objdump --disassemble applied to the object file, specifically
the findEven function:

看起来这解决了问题,对吧?但是优化器到底做了什么呢?让我们来看看objump的输出--反汇编应用于对象文件,特别是findEven函数:


0000000000000000 <_Z8findEvenyy>:
0: 48 39 d1 cmp %rdx,%rcx
3: 77 2e ja 33 <_Z8findEvenyy+0x33>
5: 48 8b 05 00 00 00 00 mov 0x0(%rip),%rax # c <_Z8findEvenyy+0xc>
c: 45 31 c0 xor %r8d,%r8d
f: 90 nop
10: f6 c1 01 test $0x1,%cl
13: 75 09 jne 1e <_Z8findEvenyy+0x1e>
15: 48 01 c8 add %rcx,%rax
18: 41 b8 01 00 00 00 mov $0x1,%r8d
1e: 48 83 c1 01 add $0x1,%rcx
22: 48 39 ca cmp %rcx,%rdx
25: 73 e9 jae 10 <_Z8findEvenyy+0x10>
27: 45 84 c0 test %r8b,%r8b
2a: 74 07 je 33 <_Z8findEvenyy+0x33>
2c: 48 89 05 00 00 00 00 mov %rax,0x0(%rip) # 33 <_Z8findEvenyy+0x33>
33: c3 ret
34: 66 66 2e 0f 1f 84 00 data16 cs nopw 0x0(%rax,%rax,1)
3b: 00 00 00 00
3f: 90 nop

The loop goes from address 10 (the first test) to address 25
(jae). The key observation is there are no memory accesses in the
loop body! The value of EvenSum is read by the instruction at 5,
then written at 2c, but only if the if statement was executed
(instruction 18 sets a flag that is checked by instruction 27).

循环从地址10(第一个测试)到地址25(JAE)。关键的观察是在循环体中没有内存访问!EvenSum的值在5处由指令读取,然后在2c处写入,但仅在IF语句被执行时(指令18设置由指令27检查的标志)。


In contrast, this is the disassembly without optimization:

相比之下,这是未经优化的反汇编:


0000000000000000 <_Z8findEvenyy>:
0: 55 push %rbp
1: 48 89 e5 mov %rsp,%rbp
4: 48 83 ec 10 sub $0x10,%rsp
8: 48 89 4d 10 mov %rcx,0x10(%rbp)
c: 48 89 55 18 mov %rdx,0x18(%rbp)
10: 48 8b 45 10 mov 0x10(%rbp),%rax
14: 48 89 45 f8 mov %rax,-0x8(%rbp)
18: eb 26 jmp 40 <_Z8findEvenyy+0x40>
1a: 48 8b 45 f8 mov -0x8(%rbp),%rax
1e: 83 e0 01 and $0x1,%eax
21: 48 85 c0 test %rax,%rax
24: 75 15 jne 3b <_Z8findEvenyy+0x3b>
26: 48 8b 15 08 00 00 00 mov 0x8(%rip),%rdx # 35 <_Z8findEvenyy+0x35>
2d: 48 8b 45 f8 mov -0x8(%rbp),%rax
31: 48 01 d0 add %rdx,%rax
34: 48 89 05 08 00 00 00 mov %rax,0x8(%rip) # 43 <_Z8findEvenyy+0x43>
3b: 48 83 45 f8 01 addq $0x1,-0x8(%rbp)
40: 48 8b 45 f8 mov -0x8(%rbp),%rax
44: 48 3b 45 18 cmp 0x18(%rbp),%rax
48: 76 d0 jbe 1a <_Z8findEvenyy+0x1a>
4a: 90 nop
4b: 90 nop
4c: 48 83 c4 10 add $0x10,%rsp
50: 5d pop %rbp
51: c3 ret

This time the loop goes from 1a to 48. Instruction 26 reads the
value of EvenSum from memory into a register, and instruction 34
writes the result of the addition back to EvenSum in memory. This is
the memory traffic that collides with the traffic from the other thread.

这一次,循环从1a变为48。指令26将EvenSum的值从存储器读取到寄存器中,并且指令34将加法的结果写回存储器中的EvenSum。这是与来自另一个线程的流量冲突的内存流量。


So why do I call enabling optimization a red herring? Because it's only
a temporary fix, for this particular definition of findEven. The
function is currently so simple that the compiler sees it can avoid
all memory traffic in the loop, so lifts that out. But as soon as you
do something more realistically complicated, the memory accesses in the
loop will return, and performance will take a big hit once again.

那么,为什么我把启用优化称为转移视线呢?因为这只是一个暂时的解决办法,对于这个特殊的findEven定义。该函数目前非常简单,编译器认为它可以避免循环中的所有内存流量,因此将其取消。但是,一旦您执行更实际的复杂操作,循环中的内存访问就会返回,性能将再次遭受重大打击。


Concluding thoughts


Diagnosing and fixing performance problems in multithreaded code is
difficult. I just guessed what the problem was and confirmed it with
the disassembly and experimentation. The "solution" I've proposed above
is obviously fragile: if my CPU had a cache line of 128 bytes instead of
64 bytes then more separation would be needed. And if the compiler
decided to put the separation variable elsewhere, or discard it
entirely, then that too would break things.

诊断和修复多线程代码中的性能问题很困难。我只是猜测出了什么问题,并通过拆卸和实验证实了这一点。我上面提出的“解决方案”显然是脆弱的:如果我的CPU有128字节的缓存线,而不是64字节,那么就需要更多的分隔。如果编译器决定将分隔变量放在其他地方,或者完全丢弃它,那么这也会破坏事情。


To reliably detect when caching issues are the cause of slow performance,
you can use a profiler that can use
hardware performance counters.
That would show the high coherence eviction count associated with the
memory instructions in this example. But such profilers are, in my
experience, fairly difficult to use (which is why I didn't in this
case).

若要可靠地检测缓存问题何时是导致性能降低的原因,可以使用可使用硬件性能计数器的探查器。这将示出与该示例中的存储器指令相关联的高一致性驱逐计数。但根据我的经验,这样的分析器相当难以使用(这就是为什么我在本例中没有使用)。


To reliably avoid coherence problems in particular, try to design your
program in such a way that you can put all the data used by each thread
physically together, ideally on separate pages of memory (a page is
typically 8 or 16 kiB) from that used by other threads. (Low-level
allocation routines such as mmap or VirtualAlloc can allocate at the
page granularity.)

为了可靠地避免一致性问题,请尝试以这样一种方式设计程序:您可以将每个线程使用的所有数据物理地放在一起,理想情况下,这些数据位于与其他线程使用的内存页不同的内存页上(一个页通常为8或16 KiB)。(低级分配例程,如mmap或VirtualAlloc,可以按页面粒度进行分配。)


更多回答

24 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com