gpt4 book ai didi

c++ - MPI_Irecv : Aborting Job 中的 fatal error

转载 作者:行者123 更新时间:2023-11-30 02:08:46 25 4
gpt4 key购买 nike

当我尝试在四个处理器上运行一个问题时收到以下错误序列。我使用的 MPI 命令是 mpirun -np 4

我很抱歉按原样发布错误消息(主要是由于缺乏破译给定信息的知识)。感谢您在以下方面的意见:

  1. 错误信息是什么意思?在什么时候收到它?是系统内存(硬件)还是通信错误(与 MPI_Isend/Irecv 相关的东西?即软件问题)。

  2. 最后,我该如何解决这个问题?

谢谢!

收到的错误信息如下::- -*请注意:仅当时间较长时才会收到此错误*。当计算数据所需的时间很小时(即 300 个时间步与 1000 个时间步相比),代码可以很好地计算

中止作业:

MPI_Irecv 中的 fatal error :其他 MPI 错误,错误堆栈:

MPI_Irecv(143): MPI_Irecv(buf=0x8294a60, count=48, MPI_DOUBLE, src=2, tag=-1, MPI_COMM_WORLD, request=0xffffd68c) 失败

MPID_Irecv(64): 内存不足

中止作业:

MPI_Irecv 中的 fatal error :其他 MPI 错误,错误堆栈:

MPI_Irecv(143): MPI_Irecv(buf=0x8295080, count=48, MPI_DOUBLE, src=3, tag=-1, MPI_COMM_WORLD, request=0xffffd690) 失败

MPID_Irecv(64): 内存不足

中止工作:MPI_Isend 中的 fatal error :内部 MPI 错误!错误堆栈:

MPI_Isend(142): MPI_Isend(buf=0x8295208, count=48, MPI_DOUBLE, dest=3, tag=0, MPI_COMM_WORLD, request=0xffffd678) 失败

(未知)():内部 MPI 错误!

中止工作:MPI_Irecv 中的 fatal error :其他 MPI 错误,错误堆栈:

MPI_Irecv(143): MPI_Irecv(buf=0x82959b0, count=48, MPI_DOUBLE, src=2, tag=-1, MPI_COMM_WORLD, request=0xffffd678) 失败

MPID_Irecv(64): 内存不足

工作 1 中的排名 3 myocyte80_37021 导致所有级别的集体中止 rank 3的退出状态:返回码13

工作 1 中的排名 1 myocyte80_37021 导致所有级别的集体中止 rank 1的退出状态:返回码13

编辑: (源代码)

Header files
Variable declaration
TOTAL TIME =
...
...
double *A = new double[Rows];
double *AA = new double[Rows];
double *B = new double[Rows;
double *BB = new double[Rows];
....
....
int Rmpi;
int my_rank;
int p;
int source;
int dest;
int tag = 0;
function declaration

int main (int argc, char *argv[])
{
MPI_Status status[8];
MPI_Request request[8];
MPI_Init (&argc, &argv);
MPI_Comm_size(MPI_COMM_WORLD, &p);
MPI_Comm_rank(MPI_COMM_WORLD, &my_rank);

//PROBLEM SPECIFIC PROPERTIES. VARY BASED ON NODE
if (Flag = 1)
{
if (my_rank == 0)
{
Defining boundary (start/stop) for special elements in tissue (Rows x Column)
}
if (my_rank == 2)
..
if (my_rank == 3)
..
if (my_rank == 4)
..
}

//INITIAL CONDITIONS ALSO VARY BASED ON NODE
for (Columns = 0; Columns<48; i++) // Normal Direction
{
for (Rows = 0; Rows<48; y++) //Transverse Direction
{
if (Flag =1 )
{
if (my_rank == 0)
{
Initial conditions for elements
}
if (my_rank == 1) //MPI
{
}
..
..
..
//SIMULATION START

while(t[0][0] < TOTAL TIME)
{
for (Columns=0; Columns ++) //Normal Direction
{
for (Rows=0; Rows++) //Transverse Direction
{
//SOME MORE PROPERTIES BASED ON NODE
if (my_rank == 0)
{
if (FLAG = 1)
{
Condition 1
}
else
{
Condition 2
}
}

if (my_rank = 1)
....
....
...

//Evaluate functions (differential equations)
Function 1 ();
Function 2 ();
...
...

//Based on output of differential equations, different nodes estimate variable values. Since
the problem is of nearest neighbor, corners and edges have different neighbors/ boundary
conditions
if (my_rank == 0)
{
If (Row/Column at bottom_left)
{
Variables =
}

if (Row/Column at Bottom Right)
{
Variables =
}
}
...
...

//Keeping track of time for each element in Row and Column. Time is updated for a certain
element.
t[Column][Row] = t[Column][Row]+dt;

}
}//END OF ROWS AND COLUMNS

// MPI IMPLEMENTATION. AT END OF EVERY TIME STEP, Nodes communicate with nearest neighbor
//First step is to populate arrays with values estimated above
for (Columns, ++)
{
for (Rows, ++)
{
if (my_rank == 0)
{
//Loading the Edges of the (Row x Column) to variables. This One dimensional Array data
is shared with its nearest neighbor for computation at next time step.

if (Column == 47)
{
A[i] = V[Column][Row];

}
if (Row == 47)
{
B[i] = V[Column][Row];
}
}

...
...

//NON BLOCKING MPI SEND RECV TO SHARE DATA WITH NEAREST NEIGHBOR

if ((my_rank) == 0)
{
MPI_Isend(A, Rows, MPI_DOUBLE, my_rank+1, 0, MPI_COMM_WORLD, &request[1]);
MPI_Irecv(AA, Rows, MPI_DOUBLE, my_rank+1, MPI_ANY_TAG, MPI_COMM_WORLD, &request[3]);
MPI_Wait(&request[3], &status[3]);
MPI_Isend(B, Rows, MPI_DOUBLE, my_rank+2, 0, MPI_COMM_WORLD, &request[5]);
MPI_Irecv(BB, Rows, MPI_DOUBLE, my_rank+2, MPI_ANY_TAG, MPI_COMM_WORLD, &request[7]);
MPI_Wait(&request[7], &status[7]);
}

if ((my_rank) == 1)
{
MPI_Irecv(CC, Rows, MPI_DOUBLE, my_rank-1, MPI_ANY_TAG, MPI_COMM_WORLD, &request[1]);
MPI_Wait(&request[1], &status[1]);
MPI_Isend(Cmpi, Rows, MPI_DOUBLE, my_rank-1, 0, MPI_COMM_WORLD, &request[3]);

MPI_Isend(D, Rows, MPI_DOUBLE, my_rank+2, 0, MPI_COMM_WORLD, &request[6]);
MPI_Irecv(DD, Rows, MPI_DOUBLE, my_rank+2, MPI_ANY_TAG, MPI_COMM_WORLD, &request[8]);
MPI_Wait(&request[8], &status[8]);
}

if ((my_rank) == 2)
{
MPI_Isend(E, Rows, MPI_DOUBLE, my_rank+1, 0, MPI_COMM_WORLD, &request[2]);
MPI_Irecv(EE, Rows, MPI_DOUBLE, my_rank+1, MPI_ANY_TAG, MPI_COMM_WORLD, &request[4]);
MPI_Wait(&request[4], &status[4]);

MPI_Irecv(FF, Rows, MPI_DOUBLE, my_rank-2, MPI_ANY_TAG, MPI_COMM_WORLD, &request[5]);
MPI_Wait(&request[5], &status[5]);
MPI_Isend(Fmpi, Rows, MPI_DOUBLE, my_rank-2, 0, MPI_COMM_WORLD, &request[7]);
}

if ((my_rank) == 3)
{
MPI_Irecv(GG, Rows, MPI_DOUBLE, my_rank-1, MPI_ANY_TAG, MPI_COMM_WORLD, &request[2]);
MPI_Wait(&request[2], &status[2]);
MPI_Isend(G, Rows, MPI_DOUBLE, my_rank-1, 0, MPI_COMM_WORLD, &request[4]);

MPI_Irecv(HH, Rows, MPI_DOUBLE, my_rank-2, MPI_ANY_TAG, MPI_COMM_WORLD, &request[6]);
MPI_Wait(&request[6], &status[6]);
MPI_Isend(H, Rows, MPI_DOUBLE, my_rank-2, 0, MPI_COMM_WORLD, &request[8]);
}

//RELOADING Data (from MPI_IRecv array to array used to compute at next time step)
for (Columns, ++)
{
for (Rows, ++)
{
if (my_rank == 0)
{
if (Column == 47)
{
V[Column][Row]= A[i];
}
if (Row == 47)
{
V[Column][Row]=B[i];
}
}

….
//PRINT TO OUTPUT FILE AT CERTAIN POINT
printval = 100;
if ((printdata>=printval))
{
prttofile ();
printdata = 0;
}
printdata = printdata+1;
compute_dt ();

}//CLOSE ALL TIME STEPS

MPI_Finalize ();

}//CLOSE MAIN

最佳答案

您是否重复调用 MPI_Irecv?如果是这样,您可能没有意识到每个调用都会分配一个请求句柄——并且在接收到消息并使用(例如)MPI_Test 测试完成时释放这些句柄。过度使用 MPI_Irecv 或 MPI 实现为此目的分配的内存可能会耗尽内存。

只有看到代码才能确认问题。

关于c++ - MPI_Irecv : Aborting Job 中的 fatal error ,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/6221058/

25 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com