gpt4 book ai didi

c - mpi_comm_spawn 错误 : MPI Application rank 0 killed before MPI_Finalize() with signal 11

转载 作者:太空宇宙 更新时间:2023-11-04 02:58:30 26 4
gpt4 key购买 nike

我正在尝试使用 MPI_Comm_Spawn 运行一个 mpi 程序。我生成 1 个工作程序,然后在两个程序中调用 MPI_reduce,以添加一些结果。出于某种原因,应用程序在 MPI_Comm_spawn 处挂起,然后在一分钟后中止。发生这种情况后,生成的进程只会到达其调用 MPI_reduce 的代码段。然后应用程序继续挂起,然后在命令提示符中给出更多错误。 应该发生的是派生程序和主程序都到达 MPI_Reduce 调用,并且主程序得到一个总和,并输出该总和。

这是输出,我在它的 MPI 输出处放了一个 <>,而不是我自己的

world size = 1   
About to call MPI_Comm_spawn with 2 workers...
parent result is 3.141668952
numDarts for child: 500000000
argv[1] = 500000000
<>MPI Application rank 0 killed before MPI_Finalize() with signal 11
spawned process got result: 3.141668952
Spawned process about to send message back to parent
<>piworker: Rank 1:0: MPI_Finalize: IBV connection to 0 on card 0 is broken
<>piworker: Rank 1:0: MPI_Finalize: ibv_poll_cq(): bad status 12
<>piworker: Rank 1:0: MPI_Finalize: self n93 peer n93 (rank: 0)
<>piworker: Rank 1:0: MPI_Finalize: error message: transport retry exceeded error

这是主程序的代码:

#include "mpi.h"
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include "globals.h"


int randSign();
double randFloat();
double dboard();


int main(int argc, char *argv[])
{
int world_size, flag;
MPI_Comm everyone; /* intercommunicator */
char worker_program[100];
int universe_size;

// MPI_Comm_get_attr(MPI_COMM_WORLD, MPI_UNIVERSE_SIZE, &universe_size, &flag);
// printf("universe size: %i\n", universe_size);

int numDarts = 1000000000;
int numWorkers = 2;

char* args[1];
if(argc >= 2)
{
numWorkers = atoi(argv[1]);
}
if(argc >= 3)
numDarts = atoi(argv[2]);

MPI_Init(&argc, &argv);
MPI_Comm_size(MPI_COMM_WORLD, &world_size);

printf("world size = %i\n", world_size);
if (world_size != 1)
printf("Top heavy with management\n");

int numDartsWorker = numDarts/numWorkers;
int numDartsMaster = numDarts/numWorkers + (numDarts % numWorkers); //the master computes the leftover
args[0] = malloc(256 * sizeof(char));
sprintf(args[0], "%i", numDartsWorker);
printf("argument passing to workers: %s\n", args[0]);
/*
* Now spawn the workers. Note that there is a run-time determination
* of what type of worker to spawn, and presumably this calculation must
* be done at run time and cannot be calculated before starting
* the program. If everything is known when the application is
* first started, it is generally better to start them all at once
* in a single MPI_COMM_WORLD.
*/
printf("About to call MPI_Comm_spawn with %i workers...\n", numWorkers);
int resultLen = 0;

double myresult = dboard(numDartsMaster);
printf("parent result is %.9f\n", myresult);


//the master counts as a worker, hence the -1
MPI_Comm_spawn("piworker", args, numWorkers-1, MPI_INFO_NULL, 0, MPI_COMM_SELF,
&everyone, MPI_ERRCODES_IGNORE);

double pisum = 24;
int rc = MPI_Reduce(&myresult, &pisum, 1, MPI_DOUBLE, MPI_SUM, 0, everyone);

if (rc != MPI_SUCCESS)
printf("failure on mpi_reduce\n");

free(args);
/*
* Parallel code here. The communicator "everyone" can be used
* to communicate with the spawned processes, which have ranks 0,..
* MPI_UNIVERSE_SIZE-1 in the remote group of the intercommunicator
* "everyone".
*/

//receive the results
int i=1;
MPI_Status status;
double avgpi = pisum/(double)numWorkers;
printf("With %i workers, %i darts, estimated value of pi is: %.9f\n", numWorkers, numDarts, avgpi);

MPI_Finalize();
return 0;
}

worker (派生)程序的代码

int main(int argc, char *argv[])
{
int size;
MPI_Comm parent;
MPI_Init(&argc, &argv);
MPI_Comm_get_parent(&parent);
if (parent == MPI_COMM_NULL)
printf("No parent!");
int taskid;
MPI_Comm_remote_size(parent, &size);
MPI_Comm_rank(MPI_COMM_WORLD,&taskid);
double pisum = 0;
int resultLen = 0;
char parentName[256];
int numDarts;


if (size != 1)
{
printf("Something's wrong with the parent");
return 1;
}
/*
* Parallel code here.
* The manager is represented as the process with rank 0 in (the remote
* group of) the parent communicator. If the workers need to communicate
* among themselves, they can use MPI_COMM_WORLD.
*/
if(argc >= 2)
numDarts = atoi(argv[1]);
else
{
printf("Error for: %i, number of darts not specified.\n", taskid);
}
printf("numDarts for child: %i\n", numDarts);
printf("argv[1] = %s\n", argv[1]);
double myPiSum = dboard(numDarts);
printf("spawned process got result: %.9f\n", myPiSum);
printf("Spawned process about to send message back to parent\n");
//MPI_Send((void *)&myPiSum, 1, MPI_DOUBLE, 0, 1, parent);

int rc = MPI_Reduce(&myPiSum, &pisum, 1, MPI_DOUBLE, MPI_SUM, 0, parent);
if(rc != MPI_SUCCESS)
printf("%d: Problem with mpi_reduce\n");


printf("Sent message back to parent");
MPI_Finalize();
return 0;
}

希望对此有更多经验的人能够更清楚地了解其原因。我一直在尝试各种各样的事情,这就是我有这么多 printf 调用的原因。

最佳答案

问题是 master 进程因为 free() 的错误使用而死掉:

char* args[1];
...
args[0] = malloc(256 * sizeof(char));
...
free(args);

您正在尝试释放非堆(堆栈)内存,并且 free(args) 在现代 glibc 版本中触发中止。正确的调用应该是:

free(args[0]);

除此之外,MPI_Reduce 在使用内部通信器调用时不会按照您期望的方式工作。您必须更改主代码,使其将 MPI_ROOT 作为根参数传递给 MPI_Reduce,然后您必须手动添加主代码,因为在缩减期间不会使用它(只有来自远程组中的进程的值正在减少 - 请参阅 here)。

关于c - mpi_comm_spawn 错误 : MPI Application rank 0 killed before MPI_Finalize() with signal 11,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/14765148/

26 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com