Simple Program that Implements a Multi-thread Example from a Tutorial Runs Slower than Single Threaded Implementation(实现教程中的多线程示例的简单程序运行速度比单线程实现慢)-6ren

Simple Program that Implements a Multi-thread Example from a Tutorial Runs Slower than Single Threaded Implementation(实现教程中的多线程示例的简单程序运行速度比单线程实现慢)

转载作者：bug小助手更新时间：2023-10-26 21:09:18

I tried following the first example from the tutorial here -> YouTube Link.

我试着按照教程中的第一个例子-> YouTube链接。

The code for the example is:

该示例的代码为：


#include <iostream>
#include <thread>
#include <chrono>
#include <algorithm>


unsigned long long OddSum = 0;
unsigned long long EvenSum = 0;

void findEven(unsigned long long start, unsigned long long end){
    for (unsigned long long i = start; i <= end; i++){
        if((i & 1)==0){
            EvenSum += i;
        }
    }
}


void findOdd(unsigned long long start, unsigned long long end){
    for (unsigned long long i = start; i <= end; i++){
        if((i & 1)==1){
            OddSum += i;
        }
    }
}

int main(){
    unsigned long long start = 0, end = 1900000000;

    auto startTime = std::chrono::high_resolution_clock::now();

    std::thread t1(findEven, start, end);
    std::thread t2(findOdd, start, end);

    t1.join();
    t2.join();

    // findOdd(start, end);
    // findEven(start, end);


    auto stopTime = std::chrono::high_resolution_clock::now();
    auto duration = std::chrono::duration_cast<std::chrono::microseconds>(stopTime - startTime);

    std::cout << "OddSum : " << OddSum << std::endl;
    std::cout << "EvenSum : " << EvenSum << std::endl;

    std::cout << "Sec : " << duration.count()/1000000 << std::endl;


    return 0;
}

When I run the single thread version it takes 3 seconds in total but the multithread version takes 18 seconds. I'm not sure what's wrong. Other posts have mentioned that threads are useless on a single core machine but I'm running on a multi-core CPU. Is there something obvious I'm overlooking? Thank you!

当我运行单线程版本时，它总共需要3秒，但多线程版本需要18秒。我不知道怎么了。其他帖子提到线程在单核机器上是无用的，但我在多核CPU上运行。我是不是忽略了什么？谢谢大家！

Tried the solution as shown in the tutorial and was expecting the multi-threaded version to be faster than the single thread implementation. The actual outcome was that the multi-threaded version took longer.

尝试了教程中所示的解决方案，并希望多线程版本比单线程实现更快。实际的结果是多线程版本需要更长的时间。

更多回答

Are you compiling with optimizations enabled? If not, then any performance measurements you take are essentially invalid.

您是否在编译时启用了优化？如果不是，那么您进行的任何性能测量基本上都是无效的。

Threads are not free to use, they have overhead. They take time to create and start. Also this code doesn't even use threads to run faster it just shows you how to start and join with them. Actually multithreading code would split the loops into smaller parts and assign those to threads.

线程不是免费使用的，它们有开销。它们需要时间来创造和启动。此外，这段代码甚至没有使用线程来提高运行速度，它只是向您展示了如何启动和连接线程。实际上，多线程代码会将循环拆分成更小的部分，并将这些部分分配给线程。

@MilesBudnek That was it. I turned of debugging mode and it did it in half the time.

@MilesBudnek就是这样。我关闭了调试模式，它只用了一半的时间就完成了。

Thank you @PepijnKramer too for your input.

也感谢您@PepjnKramer的输入。

优秀答案推荐

Problem: Cache coherence

The reason the threaded program runs more slowly than the
sequential program is the cost of the
cache coherence
protocol.

线程化程序运行得比顺序程序慢的原因是高速缓存一致性协议的成本。

In a multi-core CPU, each core has its own
Level 1 (L1) Cache, which is
fast storage that the core attempts to use to satisfy memory reads and
writes. Memory accesses that miss the L1 cache have to go to a slower
resource such as L2 or L3, or even main memory, although on a typical
multi-core CPU, at least
L3 will be shared across cores.

在多核CPU中，每个核都有自己的一级(L1)缓存，这是内核尝试使用的快速存储，以满足内存读取和写入。未命中L1缓存的内存访问必须转到较慢的资源，如L2或L3，甚至主内存，尽管在典型的多核CPU上，至少L3将在内核之间共享。

In the program in the question, one thread is accessing each of these
global variables:

在问题中的程序中，有一个线程访问以下每个全局变量：

unsigned long long OddSum = 0;
unsigned long long EvenSum = 0;

A typical C++ compiler will put these two variables next to each other
in memory. That means they are likely to share a
cache line,
meaning that if one variable is loaded into the L1 cache of one core,
then both are in that same core's L1. But a given cache line cannot be
in both cores' caches at the same time, since then the computation would
become inconsistent due to having two copies of each variable. (Note
that the precise notion of consistency depends on the
memory ordering
guarantees provided by the processor architecture.)

典型的C++编译器会在内存中将这两个变量放在一起。这意味着它们可能共享一条缓存线，也就是说，如果一个变量被加载到一个核的L1缓存中，那么两个变量都在同一个核的L1中。但是给定的高速缓存线不能同时在两个核心的高速缓存中，因为这样计算将由于每个变量有两个副本而变得不一致。(请注意，一致性的确切概念取决于处理器体系结构提供的内存排序保证。)

When thread 1 (T1) starts, it pulls the cache line containing both
variables into its cache. When thread 2 (T2) begins executing, it needs
them, so the cache coherence protocol engages, forcibly evicting the
data from the T1 cache. Approximately one nanosecond later, T1 once
again needs the data, so again coherence kicks in, and the data is
evicted from T2's cache. The two cores fight over the cache line
containing the data elements continuously for the entire computation,
and all that fighting is potentially much slower than letting one core
do all of the work.

当线程1(T1)启动时，它将包含这两个变量的缓存线拉入其缓存。当线程2(T2)开始执行时，它需要它们，因此高速缓存一致性协议启动，强制从T1高速缓存中逐出数据。大约一纳秒后，T1再次需要数据，因此再次启动一致性，并将数据从T2‘S缓存中逐出。在整个计算过程中，两个核心不断地争夺包含数据元素的缓存线，所有这些争夺都可能比让一个核心完成所有工作慢得多。

On my system (cygwin g++ 11.4.0 on Windows 10, 4-core i5-6600K CPU @
3.50GHz, median of 5 measurements, +/- is interquartile), with no
compiler optimizations, I measure:

在我的系统上(Windows 10上的cygwin g++11.4.0，四核i5-6600K CPU@3.50 GHz，5个测量值的中位数，+/-是四分位数)，在没有编译器优化的情况下，我测量：

Original program, single thread: 6489 ms +/- 28 ms

Original program, two threads: 6836 ms +/- 694 ms

The threaded version is not that much slower, but it's definitely not
faster, and the high dispersion is also a consequence of the cache
coherence mechanism being unhappy.

线程版本并没有慢很多，但绝对没有更快，而且高分散性也是高速缓存一致性机制不满意的结果。

Solution: Separating the data

We can fix the coherence problem by moving the two variables far enough
apart that they do not share a cache line. Then the variables needed
by each core can live in their respective L1s without interference.

我们可以通过将两个变量分开足够远以使它们不共享高速缓存线来解决一致性问题。然后，每个核心所需的变量可以不受干扰地存在于其各自的L1中。

To separate the variables, as a proof of concept, it is often sufficient
to declare another variable in between:

为了分离变量，作为概念证明，通常只需在其间声明另一个变量就足够了：

unsigned long long OddSum = 0;
char separation[32];          // <---- added
unsigned long long EvenSum = 0;

To confirm this solves the problem, here are the measurements:

为了确认这解决了问题，下面是测量结果：

32 bytes separation, single thread: 6455 ms +/- 28 ms

32 bytes separation, two threads: 3276 ms +/- 22 ms

On my system, less than 32 bytes of separation is not enough. By
printing out the addresses of OddSum and EvenSum, I can see that 32
bytes of separation causes the variables to be placed by the compiler 64
bytes apart. Since (in my case) OddSum is page-aligned, that means my
cache line size is 64 bytes. (Under Linux, that value is also visible
in /proc/cpuinfo.)

在我的系统上，少于32个字节的分隔是不够的。通过打印OddSum和EvenSum的地址，我可以看到32字节的分隔导致编译器将变量放置在64字节的位置。由于（在我的情况下）OddSum是页对齐的，这意味着我的缓存行大小是64字节。（在Linux下，该值也可以在/proc/cpuinfo中看到。

Red herring: Enabling compiler optimizations

Now let's return to the original program and add compiler optimizations:

现在让我们回到原始程序并添加编译器优化：

Original program, single thread, -O2: 2164 ms +/- 26 ms

Original program, two threads, -O2: 1340 ms +/- 60 ms

It looks like this solved the problem, right? But what did the
optimizer actually do? Let's look at the output of
objdump --disassemble applied to the object file, specifically
the findEven function:

看起来这解决了问题，对吧？但是优化器到底做了什么呢？让我们来看看objump的输出--反汇编应用于对象文件，特别是findEven函数：

0000000000000000 <_Z8findEvenyy>:
   0:   48 39 d1                cmp    %rdx,%rcx
   3:   77 2e                   ja     33 <_Z8findEvenyy+0x33>
   5:   48 8b 05 00 00 00 00    mov    0x0(%rip),%rax        # c <_Z8findEvenyy+0xc>
   c:   45 31 c0                xor    %r8d,%r8d
   f:   90                      nop
  10:   f6 c1 01                test   $0x1,%cl
  13:   75 09                   jne    1e <_Z8findEvenyy+0x1e>
  15:   48 01 c8                add    %rcx,%rax
  18:   41 b8 01 00 00 00       mov    $0x1,%r8d
  1e:   48 83 c1 01             add    $0x1,%rcx
  22:   48 39 ca                cmp    %rcx,%rdx
  25:   73 e9                   jae    10 <_Z8findEvenyy+0x10>
  27:   45 84 c0                test   %r8b,%r8b
  2a:   74 07                   je     33 <_Z8findEvenyy+0x33>
  2c:   48 89 05 00 00 00 00    mov    %rax,0x0(%rip)        # 33 <_Z8findEvenyy+0x33>
  33:   c3                      ret
  34:   66 66 2e 0f 1f 84 00    data16 cs nopw 0x0(%rax,%rax,1)
  3b:   00 00 00 00 
  3f:   90                      nop

The loop goes from address 10 (the first test) to address 25
(jae). The key observation is there are no memory accesses in the
loop body! The value of EvenSum is read by the instruction at 5,
then written at 2c, but only if the if statement was executed
(instruction 18 sets a flag that is checked by instruction 27).

循环从地址10(第一个测试)到地址25(JAE)。关键的观察是在循环体中没有内存访问！EvenSum的值在5处由指令读取，然后在2c处写入，但仅在IF语句被执行时(指令18设置由指令27检查的标志)。

In contrast, this is the disassembly without optimization:

相比之下，这是未经优化的反汇编：

0000000000000000 <_Z8findEvenyy>:
   0:   55                      push   %rbp
   1:   48 89 e5                mov    %rsp,%rbp
   4:   48 83 ec 10             sub    $0x10,%rsp
   8:   48 89 4d 10             mov    %rcx,0x10(%rbp)
   c:   48 89 55 18             mov    %rdx,0x18(%rbp)
  10:   48 8b 45 10             mov    0x10(%rbp),%rax
  14:   48 89 45 f8             mov    %rax,-0x8(%rbp)
  18:   eb 26                   jmp    40 <_Z8findEvenyy+0x40>
  1a:   48 8b 45 f8             mov    -0x8(%rbp),%rax
  1e:   83 e0 01                and    $0x1,%eax
  21:   48 85 c0                test   %rax,%rax
  24:   75 15                   jne    3b <_Z8findEvenyy+0x3b>
  26:   48 8b 15 08 00 00 00    mov    0x8(%rip),%rdx        # 35 <_Z8findEvenyy+0x35>
  2d:   48 8b 45 f8             mov    -0x8(%rbp),%rax
  31:   48 01 d0                add    %rdx,%rax
  34:   48 89 05 08 00 00 00    mov    %rax,0x8(%rip)        # 43 <_Z8findEvenyy+0x43>
  3b:   48 83 45 f8 01          addq   $0x1,-0x8(%rbp)
  40:   48 8b 45 f8             mov    -0x8(%rbp),%rax
  44:   48 3b 45 18             cmp    0x18(%rbp),%rax
  48:   76 d0                   jbe    1a <_Z8findEvenyy+0x1a>
  4a:   90                      nop
  4b:   90                      nop
  4c:   48 83 c4 10             add    $0x10,%rsp
  50:   5d                      pop    %rbp
  51:   c3                      ret

This time the loop goes from 1a to 48. Instruction 26 reads the
value of EvenSum from memory into a register, and instruction 34
writes the result of the addition back to EvenSum in memory. This is
the memory traffic that collides with the traffic from the other thread.

这一次，循环从1a变为48。指令26将EvenSum的值从存储器读取到寄存器中，并且指令34将加法的结果写回存储器中的EvenSum。这是与来自另一个线程的流量冲突的内存流量。

So why do I call enabling optimization a red herring? Because it's only
a temporary fix, for this particular definition of findEven. The
function is currently so simple that the compiler sees it can avoid
all memory traffic in the loop, so lifts that out. But as soon as you
do something more realistically complicated, the memory accesses in the
loop will return, and performance will take a big hit once again.

那么，为什么我把启用优化称为转移视线呢？因为这只是一个暂时的解决办法，对于这个特殊的findEven定义。该函数目前非常简单，编译器认为它可以避免循环中的所有内存流量，因此将其取消。但是，一旦您执行更实际的复杂操作，循环中的内存访问就会返回，性能将再次遭受重大打击。

Concluding thoughts

Diagnosing and fixing performance problems in multithreaded code is
difficult. I just guessed what the problem was and confirmed it with
the disassembly and experimentation. The "solution" I've proposed above
is obviously fragile: if my CPU had a cache line of 128 bytes instead of
64 bytes then more separation would be needed. And if the compiler
decided to put the separation variable elsewhere, or discard it
entirely, then that too would break things.

诊断和修复多线程代码中的性能问题很困难。我只是猜测出了什么问题，并通过拆卸和实验证实了这一点。我上面提出的“解决方案”显然是脆弱的：如果我的CPU有128字节的缓存线，而不是64字节，那么就需要更多的分隔。如果编译器决定将分隔变量放在其他地方，或者完全丢弃它，那么这也会破坏事情。

To reliably detect when caching issues are the cause of slow performance,
you can use a profiler that can use
hardware performance counters.
That would show the high coherence eviction count associated with the
memory instructions in this example. But such profilers are, in my
experience, fairly difficult to use (which is why I didn't in this
case).

若要可靠地检测缓存问题何时是导致性能降低的原因，可以使用可使用硬件性能计数器的探查器。这将示出与该示例中的存储器指令相关联的高一致性驱逐计数。但根据我的经验，这样的分析器相当难以使用（这就是为什么我在本例中没有使用）。

To reliably avoid coherence problems in particular, try to design your
program in such a way that you can put all the data used by each thread
physically together, ideally on separate pages of memory (a page is
typically 8 or 16 kiB) from that used by other threads. (Low-level
allocation routines such as mmap or VirtualAlloc can allocate at the
page granularity.)

为了可靠地避免一致性问题，请尝试以这样一种方式设计程序：您可以将每个线程使用的所有数据物理地放在一起，理想情况下，这些数据位于与其他线程使用的内存页不同的内存页上(一个页通常为8或16 KiB)。(低级分配例程，如mmap或VirtualAlloc，可以按页面粒度进行分配。)

更多回答

文章推荐： git: fix CRLF mess retroactively(Git：追溯修复CRLF混乱)

SSAS 2008 基础知识 : Tutorial
关闭。这个问题不满足Stack Overflow guidelines .它目前不接受答案。想改善这个问题吗？更新问题，使其成为 on-topic对于堆栈溢出。 4年前关闭。 Improve thi
assembly - NES游戏开发: ASM6 tutorials?
我是 NES 开发新手，我发现 Nerdy Nights 教程非常方便，但它使用 nesasm。据我了解，ASM6比nesasm好得多。我可以在哪里查找类似的教程吗？另外，有官方的 ASM6 页面吗
CakePHP 3 : Cookies tutorial
有没有关于 CakePHP 3 cookie 组件的好教程。我是 CakePHP 的新手和 Cookie以及。我正在写一个 e-commerce在 CakePHP 3.2 中的应用并想用 Cooki
artificial-intelligence - 智能代理 "tutorial"
关闭。这个问题不满足Stack Overflow guidelines .它目前不接受答案。想改善这个问题吗？更新问题，使其成为 on-topic对于堆栈溢出。 6年前关闭。 Improve thi
安卓滑动: Initial tutorial in an app
我正尝试在我的应用中做一个初始教程，如下图所示: 你知道这个功能的名称吗？为了找到它的教程。最佳答案你需要两件事 ViewPager 自定义 ViewPager 指示器 - 在 Canvas 中绘
java - 通缉 : JRibbon tutorials
按照目前的情况，这个问题不适合我们的问答形式。我们希望答案得到事实、引用或专业知识的支持，但这个问题可能会引发辩论、争论、投票或扩展讨论。如果您觉得这个问题可以改进并可能重新打开，visit the
javascript - 打开多个盒子以获取 "interactive tutorial"
在标题中解释这一点有点困难，但我想做的是制作“交互式”教程，用户单击他想要清洁或清洁的区域或产品，然后弹出另一个框提供信息，当该部分完成后，您继续使用另一个盒子，依此类推...所以我最终得到了一个接一
de.jonashackt.tutorial.controller.WeatherServiceController类的使用及代码示例
本文整理了Java中de.jonashackt.tutorial.controller.WeatherServiceController类的一些代码示例，展示了WeatherServiceContro
opengl - OpenGL : SuperBible Tutorial 期间的链接器错误
我目前正在尝试通过编写一些 DirectX 10 和 OpenGL 3.3 在 DirectX 和 OpenGL 之间做出决定。我已经完成了 DirectX 的设置，链接和编译相当容易。 OpenGl
rdf - 语义网 :Tutorial on RDF dev
我是语义网的新手。我想开始创建一些基本的 RDF 文档并发布它们以了解它是如何工作的。我找到了很多关于概念、词汇等的信息，但我找不到任何与开发相关的东西，比如开发是什么需要环境，需要什么工具。我来自
haskell - 有哪些适合初学者的 Haskell Primers/Tutorials？
关闭。这个问题不符合Stack Overflow guidelines .它目前不接受答案。我们不允许提出有关书籍、工具、软件库等建议的问题。您可以编辑问题，以便可以用事实和引用来回答它。 7年前关
java - 将 "Tutorial View"重叠在标准布局上
我想在屏幕布局上重叠一些解释界面的 View ，如下图所示因此将布局更改为: RootLayout AppLayout TutorialLayout (FrameLayout) 我
CakePHP 2 : Acl tutorial doesn't work
正如标题所说，我对 CakePHP 2 的 Acl 教程有疑问。我完全按照教程中提到的方式完成了所有操作，但仍然无法正常工作。我知道互联网上有很多人对本教程也有疑问，但我找不到任何博客条目或与我有同
javascript - 细胞景观 : TypeError even with the tutorial
我想使用 Cytoscape.js ，所以我复制粘贴了他们的 tutorial : var cy = cytoscape({ container: document.getElementByI
php - Layar开发: First Tutorial Issues
大家晚上好，Stackoverflow 领主们和女士们，我正在尝试开发一个 Layar 层以在我的工作中使用，但我一开始就遇到了一些困难。这是场景我已经下载了Layar为本教程提供的示例代码，He
javascript - 初学者 : Ajax tutorial is not working
我正在尝试使用 youtube 教程来学习 ajax。但是，我在这里创建的所有内容并不像视频中显示的那样有效。我已经仔细检查了代码好几次了。这是我的 jsfiddle html 代码。 js代码在
jquery - script-tutorials.com Colorwheel改变不同div的颜色
我正在使用这个颜色选择器来更改元素的颜色:http://www.script-tutorials.com/html5-color-picker-canvas/ .单击一个元素会打开颜色选择器，选择一种
c++ - 仅对部分顶点缓冲区进行矩阵运算 (opengl-tutorial.org)
我最近了解到: glBegin(GL_TRIANGLES); glVertex3f( 0.0f, 1.0f, 0.0f); glVertex3f(-1.0f,-1.0f, 0.0f); glVerte
c++ - Qt : AddressBook tutorial 中的内存管理
到目前为止，我已经了解到，当父对象被删除时，Qt 会释放所有子对象的内存。此外，通常不必关心在堆栈上创建的对象(即不是指针)的内存管理。现在，当我做得很好的时候"AddressBook" tutor
php - Facebook 应用程序开发 : Are there Tutorials, 以架构为中心的截屏视频等
是否有任何好的教程概述了 Facebook 应用程序经过验证的架构模式？我在他们的开发文档中找到的大部分资源都是面向 hello world 的。我正在寻找一些东西(文章、书籍、截屏视频等)来概述 F

bug小助手

个人简介

我是一名优秀的程序员,十分优秀！

作者热门文章

滴滴打车优惠券免费领取

全站热门文章

首页

博学

6Ren·AI

商城