c++ - OpenMP 如何在归约子句中使用原子指令？-6ren

c++ - OpenMP 如何在归约子句中使用原子指令？

转载作者：行者123 更新时间：2023-12-03 13:16:33

25

4

OpenMP如何使用 atomic减少构造函数中的指令？
它根本不依赖原子指令吗？
例如，变量 sum在下面的代码中累积 atomic '+'运算符(operator)？

#include <omp.h>
#include <vector>

using namespace std;
int main()
{
  int m = 1000000; 
  vector<int> v(m);
  for (int i = 0; i < m; i++)
    v[i] = i;

  int sum = 0;
  #pragma omp parallel for reduction(+:sum)
  for (int i = 0; i < m; i++)
    sum += v[i];
}

最佳答案

How does OpenMP uses atomic instruction inside reduction? Doesn't itrely on atomic at all?

由于 OpenMP 标准没有指定 reduction子句应该(或不)实现(例如，是否基于 atomic 操作)，它的实现可能会根据 OpenMP 标准的每个具体实现而有所不同。

For instance, is the variable sum in the code below accumulated withatomic + operator?

尽管如此，从 OpenMP 标准中，可以阅读以下内容:

The reduction clause can be used to perform some forms of recurrencecalculations (...) in parallel. For parallel and work-sharing constructs, aprivate copy of each list item is created, one for each implicit task,as if the private clause had been used. (...) The private copy isthen initialized as specified above. At the end of the region forwhich the reduction clause was specified, the original list item isupdated by combining its original value with the final value of eachof the private copies, using the combiner of the specifiedreduction-identifier.

因此，基于此，可以推断归约子句中使用的变量将是私有(private)的，因此不会自动更新。尽管如此，即使不是这种情况，OpenMP 标准的具体实现也不太可能依赖于 atomic。操作(对于指令 sum += v[i]; )因为(在这种情况下)不是最有效的策略。有关为什么会出现这种情况的更多信息，请查看以下 SO 线程:

Why my parallel code using openMP atomic takes a longer time than serial code? ;

Why should I use a reduction rather than an atomic variable? .

非常非正式，比使用 atomic 更有效的方法每个线程都有自己的变量 sum 的拷贝，并在 parallel region 的末尾，每个线程将其拷贝保存到线程之间共享的资源中——现在，取决于如何实现缩减， atomic操作可能用于更新该共享资源 .然后该资源将被主线程拾取，主线程将减少其内容并更新原始 sum变量，因此。
更正式地来自 OpenMP Reductions Under the Hood :

After having revisited parallel reductions in detail you might stillhave some open questions about how OpenMP actually transforms yoursequential code into parallel code. In particular, you might wonderhow OpenMP detects the portion in the body of the loop that performsthe reduction. As an example, this or a similar code fragment canoften be found in code samples:
 #pragma omp parallel for reduction(+:x)
 for (int i = 0; i < n; i++)
     x -= some_value;
You could also use - as reduction operator (which is actuallyredundant to +). But how does OpenMP isolate theupdate step x-= some_value? The discomforting answer is that OpenMPdoes not detect the update at all! The compiler treats the body of thefor-loop like this:
#pragma omp parallel for reduction(+:x)
     for (int i = 0; i < n; i++)
         x = some_expression_involving_x_or_not(x);
As a result, the modification of x could also be hidden behind an opaque > function call.This is a comprehensible decision from the point of view of a compilerdeveloper. Unfortunately, this means that you have to ensure that allupdates of x are compatible with the operation defined in thereduction clause.

The overall execution flow of a reduction can be summarized asfollows:

Spawn a team of threads and determine the set of iterations that each thread j has to perform.

Each thread declares a privatized variant of the reduction variable x initialized with the neutral element e of the correspondingmonoid.

All threads perform their iterations no matter whether or how they involve an update of the privatized variable .

The result is computed as sequential reduction over the (local) partial results and the global variable x. Finally, the result iswritten back to x.

关于c++ - OpenMP 如何在归约子句中使用原子指令？，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/65406478/

25

4

0

文章推荐： c# - Parallel.ForEach() 是否在执行后立即销毁线程？

文章推荐： c# - 使用互斥锁确保异步应用程序的单一实例

文章推荐： c - 关于 _mm_clflush (void const* p)

文章推荐： java - 使用可调用对象时主线程会发生什么

ios - 约 25 台设备的设备间通信
我需要将大约 25 台客户端设备连接到一台服务器设备(都将是 iOS，尽管 Android 会更好)。我知道这个问题有几种解决方案，我自己倾向于 MultipeerConnectivity，但同时存在
android - 管理大量数据(约 400Mb)
我正在考虑为 Android 开发鸟类目录。它将包含许多图片和音频文件，大约 400Mb。我是从这个世界开始的，但经过一些阅读后，我没有找到太多关于此类应用程序的信息。我有以下问题: 1.- 我有哪
iPhone 指南针显示错误的航向俯仰角 > 约 45°
这可能很难解释几何形状，所以我会小心拼写。这在标准 compas 应用程序和 CLLocationManager 中的数据中可见。 1)纵向手持手机时，假设俯仰角为0° 2) 当相机向上指向天空时(例
python - 替代非常大的字典(约 4000 万个键)
我有一个相当大的字典，其中包含大约 4000 万个键，我天真地通过将 {key: value, key: value, ...} 写入文本文件来存储这些键。我没有考虑到我永远无法实际访问这些数据这一事
java - 为什么当记录 > 约 600 条时查询性能急剧下降
当我在 Oracle 中使用大于约 600 条记录的“INSERT ALL”查询时，为什么查询性能会急剧下降？你能教我吗？我使用的是 Spring + Mybatis + Oracle 以下是我的查
javascript - GreaseMonkey @include 约 :newtab
我有一个要在每个页面上运行的脚本。要做到这一点非常简单，我只需设置 @include * 即可完成。它显示在每个页面上，由我在代码中分配给它的热键组合激活。它按预期工作，没有问题。但是，我希望它也可
android - 将大文件(约 2MB)上传到服务器
我正在开发一个 Android 移动应用程序，它具有按顺序向服务器上传文件的功能。我想在编写客户端服务器通信代码时牢记一个标准。由于要从移动设备上传的文件大小约为 2MB，我们需要将文件分成多个部
python - 内部加入巨大的数据框(约 200 万列)
我正在尝试根据在每个数据框中找到的一列(称为“名称”)中的匹配值来连接两个数据框(df1 和 df2) .我已经尝试使用 R 的 inner_join 函数以及 Python 的 pandas mer
database - 存储大量写入和高聚合的时间序列数据的最佳方式。 (约 10 亿点)
我正在寻找一种方法来存储带有时间戳的数据。每个时间戳可能有 1 到 10 个数据字段。我可以使用简单的数据解决方案或 SQL 将数据存储为 (time, key, value) 吗？这与我可以存储
c# - 流式音频播放延迟(约 200 毫秒)
我有一个播放流式音频数据的应用程序(如聊天客户端)。该工作流程包括三个简单的步骤: 首先发送文件头信息(采样率、每个样本的位数和 channel 数)。根据上述参数初始化音频waveout设备。音
c++ - 小词汇量语音识别(约 20 个单词)
我目前正在为我的大学做一个项目。任务是编写语音识别系统，该系统将在后台的手机上运行，等待几个命令(例如，调用 0 123 ...)。这是一个 2 个月的项目，因此不必非常准确。可接受的噪音量可以
python - 将大文本文件(约 50GB)拆分为多个文件
我想将一个大约 50GB 的大文本文件拆分成多个文件。文件中的数据是这样的-[x=0-9之间的任意整数] xxx.xxx.xxx.xxx xxx.xxx.xxx.xxx xxx.xxx.xxx.xxx
约 15 位开发人员的 Mercurial 工作流程 - 我们应该使用命名分支吗？
我的团队刚刚开始使用 Mercurial 和中央存储库。我们让 Hudson 构建了“默认”分支的尖端——这基本上是我们的主线。我们的旧 VCS 有一个 checkin 政策，即必须在 checkin
python - 我应该重复打开/关闭文件还是长时间保持打开状态(约 1 周)？
我正在为马尔可夫链蒙特卡罗反演程序实现数据收集。然而，MCMC 运行可能需要一周或更长时间才能完成!在运行开始时打开文件会更好吗: with h5py.File('my_data.hdf5', 'r+
c - fread、fwrite 适用于大尺寸视频文件(约 180MB)
我想读取视频文件并保存为二进制文件并再次写入视频文件。我用 180MB 视频进行了测试。我使用了 fread 函数，但它发生了段错误，因为视频的数组大小很小。这些是我的问题: 我使用 160*102
mysql - 类似于中等表(约 3m 条记录)上的查询性能
我有一个小问题。我有一个包含大约 300 万个城市的表，我需要对其运行 like 查询。问题是，完成查询大约需要 9 秒。有什么想法可以让它变得非常快吗？查询是: SELECT * FROM `c
linux - 很长时间(约 20 秒)未发送信号
进程从信号处理程序中发送给自身的信号在大约 20 秒内无法传递，然后它被传递了。可能的原因是什么？我想知道一般可能的原因。我正在查看的实际代码是 here 最佳答案很可能，您正在从信号处理程序
node.js - 约 5000 毫秒的 QLDB 高延迟
我正在使用“aws-sdk:^2.576.0”和“amazon-qldb-driver-nodejs:0.1.0-preview.2”，并遵循 node-sdk's sample code for q
iphone - 使用 .@count 谓词进行简单获取需要很长时间(约 30 秒)
我有 2 个实体，A 和 B，它们具有多对多关系。 A 实体大约有 10,000 个对象，B 大约有 20 个对象。基本上，A 对象可以与一个或多个 B 对象相关，并且 B 对象会跟踪它们连接到哪些
java - 约 10 秒后与 Airflow docker 容器断开连接
我成功创建并启动了这个容器: https://github.com/puckel/docker-airflow 通过运行: docker build --rm --build-arg AIRFLOW_

首页

博学

6Ren·AI

商城

c++ - OpenMP 如何在归约子句中使用原子指令？