gpt4 book ai didi

c - 发生异常时未调用 MPI 错误处理程序

转载 作者:行者123 更新时间:2023-12-02 01:08:18 26 4
gpt4 key购买 nike

过去几天,我一直在尝试使用 MPI 在 C 中编写容错应用程序。我正在尝试学习如何将错误处理程序附加到 MPI_COMM_WORLD 通信器,以便万一节点出现故障(可能是由于崩溃)并在不调用 MPI_Finalize() 的情况下退出,程序仍然可以从这种情况中恢复并继续计算。

我目前遇到的问题是,在我将错误处理函数附加到通信然后导致节点崩溃后,MPI 不会调用错误处理函数而是强制所有线程退出。

我认为这可能是我的应用程序有问题,所以我在网上查找示例代码并尝试运行它,但情况是一样的......我目前正在尝试运行的示例代码如下。 (我从这里得到 https://www.google.co.uk/url?sa=t&rct=j&q=&esrc=s&source=web&cd=1&cad=rja&ved=0CC4QFjAA&url=http%3A%2F%2Fwww.shodor.org%2Fmedia%2Fcontent%2F%2Fpetascale%2Fmaterials%2FdistributedMemory%2Fpresentations%2FMPI_Error_Example.pdf&ei=jq6KUv-BBcO30QW1oYGABg&usg=AFQjCNFa5L_Q6Irg3VrJ3fsQBIyqjBlSgA&sig2=8An4SqBvhCACx5YLwBmROA 很抱歉是 pdf 格式,但我没有写它,所以我现在在下面粘贴相同的代码):

/* Template for creating a custom error handler for MPI and a simple program 
to demonstrate its' use. How much additional information you can obtain
is determined by the MPI binding in use at build/run time.

To illustrate that the program works correctly use -np 2 through -np 4.

To illustrate an MPI error set victim_mpi = 5 and use -np 6.

To illustrate a system error set victim_os = 5 and use -np 6.

2004-10-10 charliep created
2006-07-15 joshh updated for the MPI2 standard
2007-02-20 mccoyjo adapted for folding@clusters
2010-05-26 charliep cleaned-up/annotated for the petascale workshop
*/
#include <stdio.h>
#include <stdlib.h>
#include <stdbool.h>
#include "mpi.h"

void ccg_mpi_error_handler(MPI_Comm *, int *, ...);

int main(int argc, char *argv[]) {
MPI_Status status;
MPI_Errhandler errhandler;
int number, rank, size, next, from;
const int tag = 201;
const int server = 0;
const int victim_mpi = 5;
const int victim_os = 6;

MPI_Comm bogus_communicator;
MPI_Init(&argc, &argv);!
MPI_Comm_rank(MPI_COMM_WORLD, &rank);
MPI_Comm_size(MPI_COMM_WORLD, &size);

MPI_Comm_create_errhandler(&ccg_mpi_error_handler, &errhandler);
MPI_Comm_set_errhandler(MPI_COMM_WORLD, errhandler);

next = (rank + 1) % size;
from = (rank + size - 1) % size;

if (rank == server) {
printf("Enter the number of times to go around the ring: ");
fflush(stdout);
scanf("%d", &number);
--number;
printf("Process %d sending %d to %d\n", rank, number, next);
MPI_Send(&number, 1, MPI_INT, next, tag, MPI_COMM_WORLD);
}

while (true) {
MPI_Recv(&number, 1, MPI_INT, from, tag, MPI_COMM_WORLD, &status);
printf("Process %d received %d\n", rank, number);
if (rank == server) {
number--;
printf("Process 0 decremented number\n");
}

if (rank == victim_os) {
int a[10];
printf("Process %d about to segfault\n", rank);
a[15565656] = 56;
}

if (rank == victim_mpi) {
printf("Process %d about to go south\n", rank);
printf("Process %d sending %d to %d\n", rank, number, next);
MPI_Send(&number, 1, MPI_INT, next, tag, bogus_communicator);
} else {
printf("Process %d sending %d to %d\n", rank, number, next);
MPI_Send(&number, 1, MPI_INT, next, tag, MPI_COMM_WORLD);
}

if (number == 0) {
printf("Process %d exiting\n", rank);
break;
}
}

if (rank == server)
MPI_Recv(&number, 1, MPI_INT, from, tag, MPI_COMM_WORLD, &status);

MPI_Finalize();
return 0;
}

void ccg_mpi_error_handler(MPI_Comm *communicator, int *error_code, ...) {
char error_string[MPI_MAX_ERROR_STRING];
int error_string_length;
printf("ccg_mpi_error_handler: entry\n");
printf("ccg_mpi_error_handler: error_code = %d\n", *error_code);
MPI_Error_string(*error_code, error_string, &error_string_length);
error_string[error_string_length] = '\0';
printf("ccg_mpi_error_handler: error_string = %s\n", error_string);
printf("ccg_mpi_error_handler: exit\n");
exit(1);
}

该程序实现了一个简单的 token 环,如果您为其提供评论中描述的参数,那么我会得到如下信息:

    >>>>>>mpirun -np 6 example.exe
Enter the number of times to go around the ring: 6
Process 1 received 5
Process 1 sending 5 to 2
Process 2 received 5
Process 2 sending 5 to 3
Process 3 received 5
Process 3 sending 5 to 4
Process 4 received 5
Process 4 sending 5 to 5
Process 5 received 5
Process 5 about to go south
Process 5 sending 5 to 0
Process 0 sending 5 to 1
[HP-ENVY-dv6-Notebook-PC:09480] *** Process received signal ***
[HP-ENVY-dv6-Notebook-PC:09480] Signal: Segmentation fault (11)
[HP-ENVY-dv6-Notebook-PC:09480] Signal code: Address not mapped (1)
[HP-ENVY-dv6-Notebook-PC:09480] Failing at address: 0xf0b397
[HP-ENVY-dv6-Notebook-PC:09480] [ 0] /lib/x86_64-linux-gnu/libpthread.so.0(+0xfcb0) [0x7fc0ec688cb0]
[HP-ENVY-dv6-Notebook-PC:09480] [ 1] /usr/lib/libmpi.so.0(PMPI_Send+0x74) [0x7fc0ec8f3704]
[HP-ENVY-dv6-Notebook-PC:09480] [ 2] example.exe(main+0x23f) [0x400e63]
[HP-ENVY-dv6-Notebook-PC:09480] [ 3] /lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0xed) [0x7fc0ec2da76d]
[HP-ENVY-dv6-Notebook-PC:09480] [ 4] example.exe() [0x400b69]
[HP-ENVY-dv6-Notebook-PC:09480] *** End of error message ***
--------------------------------------------------------------------------
mpirun noticed that process rank 5 with PID 9480 on node andres-HP-ENVY-dv6-Notebook-PC exited on signal 11 (Segmentation fault).
--------------------------------------------------------------------------

很明显,在我看到的输出中,ccg_mpi_error_handler() 中的 printf() 都没有被执行,所以我假设根本没有调用处理程序。我不确定它是否有任何帮助,但我正在运行 ubuntu linux 12.04 并且我使用 apt-get 安装了 MPI。我用来编译程序的命令如下:

mpicc err_example.c -o example.exe

此外,当我执行 mpicc -v 时,我得到以下信息:

  Using built-in specs.
COLLECT_GCC=/usr/bin/gcc
COLLECT_LTO_WRAPPER=/usr/lib/gcc/x86_64-linux-gnu/4.6/lto-wrapper
Target: x86_64-linux-gnu
Configured with: ../src/configure -v --with-pkgversion='Ubuntu/Linaro 4.6.3-1ubuntu5' --with-bugurl=file:///usr/share/doc/gcc-4.6/README.Bugs --enable-languages=c,c++,fortran,objc,obj-c++ --prefix=/usr --program-suffix=-4.6 --enable-shared --enable-linker-build-id --with-system-zlib --libexecdir=/usr/lib --without-included-gettext --enable-threads=posix --with-gxx-include-dir=/usr/include/c++/4.6 --libdir=/usr/lib --enable-nls --with-sysroot=/ --enable-clocale=gnu --enable-libstdcxx-debug --enable-libstdcxx-time=yes --enable-gnu-unique-object --enable-plugin --enable-objc-gc --disable-werror --with-arch-32=i686 --with-tune=generic --enable-checking=release --build=x86_64-linux-gnu --host=x86_64-linux-gnu --target=x86_64-linux-gnu
Thread model: posix
gcc version 4.6.3 (Ubuntu/Linaro 4.6.3-1ubuntu5)

非常感谢帮助!谢谢...

最佳答案

MPI 标准甚至不要求 MPI 实现能够优雅地处理错误。 MPI-3.0 的 §8.3 的以下摘录说明了一切:

An MPI implementation cannot or may choose not to handle some errors that occur during MPI calls. These can include errors that generate exceptions or traps, such as floating point errors or access violations. The set of errors that are handled by MPI is implementation-dependent. Each such error generates an MPI exception.

The above text takes precedence over any text on error handling within this document. Specifically, text that states that errors will be handled should be read as may be handled.

(保留原始格式,包括使用粗体和斜体)

造成这种情况的原因有很多,但其中大部分都与性能和可靠性之间的某种权衡有关。在不同级别进行错误检查并妥善处理错误情况会产生一些不太小的开销,并使库代码库非常复杂。

也就是说,并不是所有的 MPI 库都是一样的。其中一些实现了比其他更好的容错能力。例如,与 Intel MPI 4.1 相同的代码:

...
Process 5 about to go south
Process 5 sending 5 to 0
ccg_mpi_error_handler: entry
ccg_mpi_error_handler: error_code = 403287557
ccg_mpi_error_handler: error_string = Invalid communicator, error stack:
MPI_Send(186): MPI_Send(buf=0x7fffa32a7308, count=1, MPI_INT, dest=0, tag=201, comm=0x0) failed
MPI_Send(87).: Invalid communicator
ccg_mpi_error_handler: exit

您的案例中错误消息的格式表明您正在使用 Open MPI。 Open MPI 中的容错是一种实验性的(OMPI 开发人员之一,即 Jeff Squyres,不时访问 Stack Overflow - 他可以给出更明确的答案)并且必须在库构建时通过一个选项明确启用像 --enable-ft=LAM

默认情况下 MPICH 也不能处理这样的情况:

Process 5 about to go south
Process 5 sending 5 to 0

===================================================================================
= BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES
= EXIT CODE: 139
= CLEANING UP REMAINING PROCESSES
= YOU CAN IGNORE THE BELOW CLEANUP MESSAGES
===================================================================================
YOUR APPLICATION TERMINATED WITH THE EXIT STRING: Segmentation fault (signal 11)
This typically refers to a problem with your application.
Please see the FAQ page for debugging suggestions

请注意,目前 MPI 不保证在检测到错误时程序状态保持一致:

After an error is detected, the state of MPI is undefined. That is, using a user-defined error handler, or MPI_ERRORS_RETURN, does not necessarily allow the user to continue to use MPI after an error is detected. The purpose of these error handlers is to allow a user to issue user-defined error messages and to take actions unrelated to MPI (such as flushing I/O buffers) before a program exits. An MPI implementation is free to allow MPI to continue after an error but is not required to do so.

其中一个原因是无法在这种“损坏的”通信器上执行集体操作,并且许多内部 MPI 机制需要所有级别之间的集体信息共享。一个更好的容错机制称为贯穿稳定 (RTS) 被提议包含在 MPI-3.0 中,但它没有通过最终投票。使用 RTS,添加了一个新的 MPI 调用,它通过集体删除所有失败的进程,从损坏的通信器创建一个健康的通信器,然后剩余的进程可以继续在新的通信器中运行。

免责声明:我不为英特尔工作,也不认可他们的产品。只是 IMPI 提供了比 Open MPI 和 MPICH 的默认构建配置更好的开箱即用的用户错误处理实现。通过更改构建选项或将来可能会出现适当的 FT(例如,在 Open MPI 中有一个 RTS 的原型(prototype)实现)

关于c - 发生异常时未调用 MPI 错误处理程序,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/20061164/

26 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com