gpt4 book ai didi

c++ - MPI_Allreduce 中的 fatal error

转载 作者:搜寻专家 更新时间:2023-10-31 01:00:51 26 4
gpt4 key购买 nike

我需要使用 MPICH 创建集群。在这种情况下,我首先在一台机器上尝试了这些示例 ( http://mpitutorial.com/beginner-mpi-tutorial/ ),它们按预期工作。然后我根据这个(https://help.ubuntu.com/community/MpichCluster)创建集群并运行下面给出的例子并且它有效。

#include <stdio.h>
#include <mpi.h>

int main(int argc, char** argv) {
int myrank, nprocs;

MPI_Init(&argc, &argv);
MPI_Comm_size(MPI_COMM_WORLD, &nprocs);
MPI_Comm_rank(MPI_COMM_WORLD, &myrank);

printf("Hello from processor %d of %d\n", myrank, nprocs);

MPI_Finalize();
return 0;

mpiexec -n 8 -f machinefile ./mpi_hello

所以接下来我运行了这个例子(http://mpitutorial.com/tutorials/mpi-reduce-and-allreduce/),但那时我收到了这个错误。不知道哪里出了问题?

    Fatal error in MPI_Allreduce: A process has failed, error stack:
MPI_Allreduce(861)........: MPI_Allreduce(sbuf=0x7ffff0f55630, rbuf=0x7ffff0f55634, count=1, MPI_FLOAT, MPI_SUM, MPI_COMM_WORLD) failed
MPIR_Allreduce_impl(719)..:
MPIR_Allreduce_intra(362).:
dequeue_and_set_error(888): Communication error with rank 1

===================================================================================
= BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES
= EXIT CODE: 1
= CLEANING UP REMAINING PROCESSES
= YOU CAN IGNORE THE BELOW CLEANUP MESSAGES
===================================================================================
[proxy:0:1@ce-412] HYD_pmcd_pmip_control_cmd_cb (./pm/pmiserv/pmip_cb.c:886): assert (!closed) failed
[proxy:0:1@ce-412] HYDT_dmxu_poll_wait_for_event (./tools/demux/demux_poll.c:77): callback returned error status
[proxy:0:1@ce-412] main (./pm/pmiserv/pmip.c:206): demux engine error waiting for event
[mpiexec@ce-411] HYDT_bscu_wait_for_completion (./tools/bootstrap/utils/bscu_wait.c:76): one of the processes terminated badly; aborting
[mpiexec@ce-411] HYDT_bsci_wait_for_completion (./tools/bootstrap/src/bsci_wait.c:23): launcher returned error waiting for completion
[mpiexec@ce-411] HYD_pmci_wait_for_completion (./pm/pmiserv/pmiserv_pmci.c:217): launcher returned error waiting for completion
[mpiexec@ce-411] main (./ui/mpich/mpiexec.c:331): process manager error waiting for completion

最佳答案

是的,正如@Alexey 提到的,这完全是网络错误。以下是我所做的工作。

1).将主机文件导出为 HYDRA_HOST_FILE 以了解 MPICH(有关更多信息:https://wiki.mpich.org/mpich/index.php/Using_the_Hydra_Process_Manager)

    export HYDRA_HOST_FILE=<path_to_host_file>/hosts

2).我必须解决这个问题(http://lists.mpich.org/pipermail/discuss/2013-January/000285.html)

   -disable-hostname-propagation

最后是命令,它为我提供了集群节点之间的正确连接。

  mpiexec -launcher fork -disable-hostname-propagation  -f machinefile -np 4 ./Test

关于c++ - MPI_Allreduce 中的 fatal error ,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/30205551/

26 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com