c - 发生异常时未调用 MPI 错误处理程序-6ren

c - 发生异常时未调用 MPI 错误处理程序

转载作者：行者123 更新时间：2023-12-02 01:08:18

过去几天，我一直在尝试使用 MPI 在 C 中编写容错应用程序。我正在尝试学习如何将错误处理程序附加到 MPI_COMM_WORLD 通信器，以便万一节点出现故障(可能是由于崩溃)并在不调用 MPI_Finalize() 的情况下退出，程序仍然可以从这种情况中恢复并继续计算。

我目前遇到的问题是，在我将错误处理函数附加到通信然后导致节点崩溃后，MPI 不会调用错误处理函数而是强制所有线程退出。

我认为这可能是我的应用程序有问题，所以我在网上查找示例代码并尝试运行它，但情况是一样的......我目前正在尝试运行的示例代码如下。 (我从这里得到 https://www.google.co.uk/url?sa=t&rct=j&q=&esrc=s&source=web&cd=1&cad=rja&ved=0CC4QFjAA&url=http%3A%2F%2Fwww.shodor.org%2Fmedia%2Fcontent%2F%2Fpetascale%2Fmaterials%2FdistributedMemory%2Fpresentations%2FMPI_Error_Example.pdf&ei=jq6KUv-BBcO30QW1oYGABg&usg=AFQjCNFa5L_Q6Irg3VrJ3fsQBIyqjBlSgA&sig2=8An4SqBvhCACx5YLwBmROA 很抱歉是 pdf 格式，但我没有写它，所以我现在在下面粘贴相同的代码):

/* Template for creating a custom error handler for MPI and a simple program 
to demonstrate its' use. How much additional information you can obtain 
is determined by the MPI binding in use at build/run time. 

To illustrate that the program works correctly use -np 2 through -np 4.

To illustrate an MPI error set victim_mpi = 5 and use -np 6.

To illustrate a system error set victim_os = 5 and use -np 6.

2004-10-10 charliep created
2006-07-15 joshh  updated for the MPI2 standard
2007-02-20 mccoyjo  adapted for folding@clusters
2010-05-26 charliep cleaned-up/annotated for the petascale workshop 
*/
#include <stdio.h>
#include <stdlib.h>
#include <stdbool.h>
#include "mpi.h"

void ccg_mpi_error_handler(MPI_Comm *, int *, ...);

int main(int argc, char *argv[]) {
    MPI_Status status;
    MPI_Errhandler errhandler;
    int number, rank, size, next, from;
    const int tag = 201;
    const int server = 0;
    const int victim_mpi = 5;
    const int victim_os = 6;

    MPI_Comm bogus_communicator;
    MPI_Init(&argc, &argv);!
    MPI_Comm_rank(MPI_COMM_WORLD, &rank);
    MPI_Comm_size(MPI_COMM_WORLD, &size);

    MPI_Comm_create_errhandler(&ccg_mpi_error_handler, &errhandler);
    MPI_Comm_set_errhandler(MPI_COMM_WORLD, errhandler);

    next = (rank + 1) % size;
    from = (rank + size - 1) % size;

    if (rank == server) {
        printf("Enter the number of times to go around the ring: ");
        fflush(stdout);
        scanf("%d", &number);                                              
        --number;
        printf("Process %d sending %d to %d\n", rank, number, next);
        MPI_Send(&number, 1, MPI_INT, next, tag, MPI_COMM_WORLD);
    }

    while (true) {
        MPI_Recv(&number, 1, MPI_INT, from, tag, MPI_COMM_WORLD, &status);
        printf("Process %d received %d\n", rank, number);
        if (rank == server) {
            number--;
            printf("Process 0 decremented number\n");
        }

        if (rank == victim_os) {
            int a[10];
            printf("Process %d about to segfault\n", rank);
            a[15565656] = 56;
        }

        if (rank == victim_mpi) {
            printf("Process %d about to go south\n", rank);
            printf("Process %d sending %d to %d\n", rank, number, next);
           MPI_Send(&number, 1, MPI_INT, next, tag, bogus_communicator);
        } else {
            printf("Process %d sending %d to %d\n", rank, number, next);
            MPI_Send(&number, 1, MPI_INT, next, tag, MPI_COMM_WORLD);
        }

        if (number == 0) {
            printf("Process %d exiting\n", rank);
            break;
        }
    }

    if (rank == server)
        MPI_Recv(&number, 1, MPI_INT, from, tag, MPI_COMM_WORLD, &status);

    MPI_Finalize();
    return 0;
}

void ccg_mpi_error_handler(MPI_Comm *communicator, int *error_code, ...) {
    char error_string[MPI_MAX_ERROR_STRING];
    int error_string_length;
    printf("ccg_mpi_error_handler: entry\n");
    printf("ccg_mpi_error_handler: error_code = %d\n", *error_code);
    MPI_Error_string(*error_code, error_string, &error_string_length);
    error_string[error_string_length] = '\0';
    printf("ccg_mpi_error_handler: error_string = %s\n", error_string);
    printf("ccg_mpi_error_handler: exit\n");
    exit(1);
}

该程序实现了一个简单的 token 环，如果您为其提供评论中描述的参数，那么我会得到如下信息:

    >>>>>>mpirun -np 6 example.exe
    Enter the number of times to go around the ring: 6
    Process 1 received 5
    Process 1 sending 5 to 2
    Process 2 received 5
    Process 2 sending 5 to 3
    Process 3 received 5
    Process 3 sending 5 to 4
    Process 4 received 5
    Process 4 sending 5 to 5
    Process 5 received 5
    Process 5 about to go south
    Process 5 sending 5 to 0
    Process 0 sending 5 to 1
    [HP-ENVY-dv6-Notebook-PC:09480] *** Process received signal *** 
    [HP-ENVY-dv6-Notebook-PC:09480] Signal: Segmentation fault (11)
    [HP-ENVY-dv6-Notebook-PC:09480] Signal code: Address not mapped (1) 
    [HP-ENVY-dv6-Notebook-PC:09480] Failing at address: 0xf0b397
    [HP-ENVY-dv6-Notebook-PC:09480] [ 0] /lib/x86_64-linux-gnu/libpthread.so.0(+0xfcb0) [0x7fc0ec688cb0]
    [HP-ENVY-dv6-Notebook-PC:09480] [ 1] /usr/lib/libmpi.so.0(PMPI_Send+0x74) [0x7fc0ec8f3704]
    [HP-ENVY-dv6-Notebook-PC:09480] [ 2] example.exe(main+0x23f) [0x400e63]
    [HP-ENVY-dv6-Notebook-PC:09480] [ 3] /lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0xed) [0x7fc0ec2da76d]
    [HP-ENVY-dv6-Notebook-PC:09480] [ 4] example.exe() [0x400b69]
    [HP-ENVY-dv6-Notebook-PC:09480] *** End of error message *** 
    --------------------------------------------------------------------------
    mpirun noticed that process rank 5 with PID 9480 on node andres-HP-ENVY-dv6-Notebook-PC exited on signal 11 (Segmentation fault).
    --------------------------------------------------------------------------

很明显，在我看到的输出中，ccg_mpi_error_handler() 中的 printf() 都没有被执行，所以我假设根本没有调用处理程序。我不确定它是否有任何帮助，但我正在运行 ubuntu linux 12.04 并且我使用 apt-get 安装了 MPI。我用来编译程序的命令如下:

mpicc err_example.c -o example.exe

此外，当我执行 mpicc -v 时，我得到以下信息:

  Using built-in specs.
  COLLECT_GCC=/usr/bin/gcc
  COLLECT_LTO_WRAPPER=/usr/lib/gcc/x86_64-linux-gnu/4.6/lto-wrapper
  Target: x86_64-linux-gnu
  Configured with: ../src/configure -v --with-pkgversion='Ubuntu/Linaro 4.6.3-1ubuntu5' --with-bugurl=file:///usr/share/doc/gcc-4.6/README.Bugs --enable-languages=c,c++,fortran,objc,obj-c++ --prefix=/usr --program-suffix=-4.6 --enable-shared --enable-linker-build-id --with-system-zlib --libexecdir=/usr/lib --without-included-gettext --enable-threads=posix --with-gxx-include-dir=/usr/include/c++/4.6 --libdir=/usr/lib --enable-nls --with-sysroot=/ --enable-clocale=gnu --enable-libstdcxx-debug --enable-libstdcxx-time=yes --enable-gnu-unique-object --enable-plugin --enable-objc-gc --disable-werror --with-arch-32=i686 --with-tune=generic --enable-checking=release --build=x86_64-linux-gnu --host=x86_64-linux-gnu --target=x86_64-linux-gnu
  Thread model: posix
  gcc version 4.6.3 (Ubuntu/Linaro 4.6.3-1ubuntu5)

非常感谢帮助!谢谢...

最佳答案

MPI 标准甚至不要求 MPI 实现能够优雅地处理错误。 MPI-3.0 的 §8.3 的以下摘录说明了一切:

An MPI implementation cannot or may choose not to handle some errors that occur during MPI calls. These can include errors that generate exceptions or traps, such as floating point errors or access violations. The set of errors that are handled by MPI is implementation-dependent. Each such error generates an MPI exception.

The above text takes precedence over any text on error handling within this document. Specifically, text that states that errors will be handled should be read as may be handled.

(保留原始格式，包括使用粗体和斜体)

造成这种情况的原因有很多，但其中大部分都与性能和可靠性之间的某种权衡有关。在不同级别进行错误检查并妥善处理错误情况会产生一些不太小的开销，并使库代码库非常复杂。

也就是说，并不是所有的 MPI 库都是一样的。其中一些实现了比其他更好的容错能力。例如，与 Intel MPI 4.1 相同的代码:

...
Process 5 about to go south
Process 5 sending 5 to 0
ccg_mpi_error_handler: entry
ccg_mpi_error_handler: error_code = 403287557
ccg_mpi_error_handler: error_string = Invalid communicator, error stack:
MPI_Send(186): MPI_Send(buf=0x7fffa32a7308, count=1, MPI_INT, dest=0, tag=201, comm=0x0) failed
MPI_Send(87).: Invalid communicator
ccg_mpi_error_handler: exit

您的案例中错误消息的格式表明您正在使用 Open MPI。 Open MPI 中的容错是一种实验性的(OMPI 开发人员之一，即 Jeff Squyres，不时访问 Stack Overflow - 他可以给出更明确的答案)并且必须在库构建时通过一个选项明确启用像 --enable-ft=LAM。

默认情况下 MPICH 也不能处理这样的情况:

Process 5 about to go south
Process 5 sending 5 to 0

===================================================================================
=   BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES
=   EXIT CODE: 139
=   CLEANING UP REMAINING PROCESSES
=   YOU CAN IGNORE THE BELOW CLEANUP MESSAGES
===================================================================================
YOUR APPLICATION TERMINATED WITH THE EXIT STRING: Segmentation fault (signal 11)
This typically refers to a problem with your application.
Please see the FAQ page for debugging suggestions

请注意，目前 MPI 不保证在检测到错误时程序状态保持一致:

After an error is detected, the state of MPI is undefined. That is, using a user-defined error handler, or MPI_ERRORS_RETURN, does not necessarily allow the user to continue to use MPI after an error is detected. The purpose of these error handlers is to allow a user to issue user-defined error messages and to take actions unrelated to MPI (such as flushing I/O buffers) before a program exits. An MPI implementation is free to allow MPI to continue after an error but is not required to do so.

其中一个原因是无法在这种“损坏的”通信器上执行集体操作，并且许多内部 MPI 机制需要所有级别之间的集体信息共享。一个更好的容错机制称为贯穿稳定 (RTS) 被提议包含在 MPI-3.0 中，但它没有通过最终投票。使用 RTS，添加了一个新的 MPI 调用，它通过集体删除所有失败的进程，从损坏的通信器创建一个健康的通信器，然后剩余的进程可以继续在新的通信器中运行。

免责声明:我不为英特尔工作，也不认可他们的产品。只是 IMPI 提供了比 Open MPI 和 MPICH 的默认构建配置更好的开箱即用的用户错误处理实现。通过更改构建选项或将来可能会出现适当的 FT(例如，在 Open MPI 中有一个 RTS 的原型(prototype)实现)

关于c - 发生异常时未调用 MPI 错误处理程序，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/20061164/

文章推荐： firefox - 传单瓷砖线条可见

文章推荐： winapi - 在汇编程序中链接到 Kernel32.lib

文章推荐： assembly - ARM:为什么立即数只有 12 位？

java - 让 CompletableFuture 异常()处理 supplyAsync() 异常
问题很简单:我正在寻找一种优雅的使用方式 CompletableFuture#exceptionally与 CompletableFuture#supplyAsync 一起.这是行不通的: priva
java - 从 XSD 生成 Java 异常/使用 JAXB2 绑定(bind)异常
对于 Web 服务，我们通常使用 maven-jaxb2-plugin 生成 java bean，并在 Spring 中使用 JAXB2 编码。我想知道如何处理 WSDL/XSD 中声明的(SOAP-
c - 当我违反数组大小限制时，为什么我没有收到 OutOfBounds 异常(如 Java 异常)或 C 中的任何其他错误？
这个问题已经有答案了: Array index out of bound behavior (10 个回答) 已关闭 8 年前。我对下面的 C 代码感到好奇 int main(){
java - 为什么 MediaPlayer.create 在类的开头初始化时会抛出 NullPointer 异常，而在 OnCreate 方法中初始化时不会抛出 NullPointer 异常？
当在类的开头使用上下文和资源初始化 MediaPlayer 对象时，它会抛出 NullPointer 异常，但是当在类的开头声明它时(因此它是 null)，然后以相同的方式初始化它在onCreate方
java - JAVA 6 中出现 SSL 异常，但 JAVA 8 中没有 SSL 异常
嘿我尝试将 java 程序连接到 REST API。使用相同的代码部分，我在 Java 6 中遇到了 Java 异常，并且在 Java 8 中运行良好。环境相同: 信任机器 unix 用户代
linux - 异常(exception)如下。 org.apache.flume.FlumeException : Unable to load source type in flume twitter analysis 异常
我正在尝试使用 Flume 和 Hive 进行 Twitter 分析。为了从 twitter 获取推文，我在 flume.conf 文件中设置了所有必需的参数(consumerKey、consumer
JavaFX 异常
我在 JavaFX 异常方面遇到一些问题。我的项目在我的 Eclipse 中运行，但现在我的 friend 也尝试访问该项目。我们已共享并直接保存到保管箱文件夹中。但他根本无法让它发挥作用。他在控制台
Jquery模糊()异常
假设我使用 blur() 事件验证了电子邮件 ID，我正在这样做: $('#email').blur(function(){ //make ajax call , check if dupli
调用回调函数时出现C#异常
我这样做是为了从 C 代码调用非托管函数。 pCallback 是一个函数指针，因此在托管端是一个委托(delegate)。 [DllImport("MyDLL.dll")] public stati
Java:异常
为什么这段代码是正确的: try { } catch(ArrayOutOfBoundsException e) {} 这是错误的: try { } catch(IOException e) {} 这段
调用dll函数后未捕获C++异常
我遇到了以下问题:有导出函数的DLL。代码示例如下:[动态链接库] __declspec(dllexport) int openDevice(int,void**) [应用] 开发者.h: __de
析构函数中的c++异常
从其他线程，我知道我们不应该在析构函数中抛出异常!但是对于下面的例子，它确实有效。这是否意味着我们只能在一个实例的析构函数中抛出异常？我们应该如何理解这个代码示例! #include using n
Java基础——异常
为什么需要异常引出 public static void main(String[
Java经典面试题汇总:异常
1. Java的异常机制 Throwable类是Java异常类型的顶层父类，一个对象只有是 Throwable 类的(直接或者间接)实例，他才是一个异常对象，才能被异常处理机制识别。JDK中内
python - “异常”对象不可调用
我是 Python 的新手，我对某种异常方法的实现有疑问。这是代码(缩写): class OurException(Exception): """User defined Exception"
Cassandra ArrayIndexOutOfBoundsException 异常
我已经创建了以下模式来表示用户和一组线程之间的关联，这些线程按他们的最后一条消息排序(用户已经阅读了哪些线程，哪些没有): CREATE TABLE table(user_id bigint, mes
Python 异常 - 捕获除预期之外的所有异常
我正在使用 Python 编写一个简单的自动化脚本，它可能会在多个位置引发异常。在他们每个人中，我都想记录一条特定的消息并退出程序。为此，我在捕获异常并处理它(执行特定的日志记录操作等)后引发 Sys
F# printfn 异常
谁能解释一下为什么这会导致错误: let xs = [| "Mary"; "Mungo"; "Midge" |] Array.iter printfn xs 虽然不是这样: Array.iter pr
安全登录后尝试访问任何页面时发生 JPA 异常
在我使用 Play! 的网站上，我有一个管理部分。所有 Admin Controller 都有一个 @With 和一个 @Check 注释。断开连接后，一切正常。连接后，每次加载页面(任何页面，无论
仅当部署在服务器上而非本地主机上时出现 Azure 异常
我尝试连接到 azure 表存储并添加一个对象。它在本地主机上工作得很好，但是在我使用的服务器上我得到以下异常及其内部异常: Exception of type 'Microsoft.Wind

行者123

个人简介

我是一名优秀的程序员,十分优秀！

作者热门文章

滴滴打车优惠券免费领取

全站热门文章

首页

博学

6Ren·AI

商城

c - 发生异常时未调用 MPI 错误处理程序