gpt4 book ai didi

azure - 如何修复 MPI_ERR_RMA_SHARED?

转载 作者:行者123 更新时间:2023-12-03 02:51:36 24 4
gpt4 key购买 nike

我编写了一个 MPI 程序,其中通过 MPI_Win_Allocate_shared 命令使用共享内存,然后在 Azure 上具有 4 个 cpu 的虚拟机上运行该程序。一切都适用于 1 或进程,但不适用于 3 或 4。

我知道 MPI_Win_Allocate_shared 仅当进程位于同一节点上时才起作用,所以我认为问题与此有关。我尝试使用主机文件设置“AzureVM slot=4 max_slots=8”来解决该问题,但仍然出现错误。我将报告以下错误:

mpiexec -np 3 --hostfile my_host --oversubscribe tables

[AzureVM][[37487,1],1][btl_openib_component.c:652:init_one_port] ibv_query_gid failed (mlx4_0:1, 0)

[AzureVM][[37487,1],0][btl_openib_component.c:652:init_one_port] ibv_query_gid failed (mlx4_0:1, 0)

[AzureVM][[37487,1],2][btl_openib_component.c:652:init_one_port] ibv_query_gid failed (mlx4_0:1, 0)

--------------------------------------------------------------------------
WARNING: There was an error initializing an OpenFabrics device.

Local host: AzureVM
Local device: mlx4_0
--------------------------------------------------------------------------

[AzureVM:01918] 2 more processes have sent help message help-mpi-btl-openib.txt / error in device init
[AzureVM:01918] Set MCA parameter "orte_base_help_aggregate" to 0 to see all help / error messages

[AzureVM:1930] *** An error occurred in MPI_Win_allocate_shared
[AzureVM:1930] *** reported by process [2456748033,2]
[AzureVM:1930] *** on communicator MPI_COMM_WORLD
[AzureVM:1930] *** MPI_ERR_RMA_SHARED: Memory cannot be shared
[AzureVM:1930] *** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort,
[AzureVM:1930] *** and potentially your MPI job)
[AzureVM:01918] 1 more process has sent help message help-mpi-errors.txt / mpi_errors_are_fatal
Makefile:54: recipe for target 'table' failed
make: *** [table] Error 71

请问有人可以解释一下如何解决这个问题吗?预先感谢您!

最佳答案

您好,请问您的问题解决了吗?考虑添加这两行(在 quide 之后)

MPI_Comm nodecomm;                                                          
MPI_Comm_split_type(MPI_COMM_WORLD, MPI_COMM_TYPE_SHARED, 0, MPI_INFO_NULL, &nodecomm);

之后,分配内存

// define alloc_length (sth like: int alloc_length = 10 * sizeof(int);)
MPI_Win win;
MPI_Win_allocate_shared (alloc_length, 1, info, shmcomm, &mem, &win);

我遇到了同样的问题(至少有类似的错误日志)并按照我上面描述的方式完全解决了它

为了更好地理解,请参阅this 。我在选择的最佳答案末尾测试了代码,不幸的是,它对我不起作用。我修改如下:

#include <stdio.h>
#include <mpi.h>

#define ARRAY_LEN 32

int main() {
MPI_Init(NULL, NULL);

int * baseptr;
MPI_Comm nodecomm;
MPI_Comm_split_type(MPI_COMM_WORLD, MPI_COMM_TYPE_SHARED, 0,
MPI_INFO_NULL, &nodecomm);

int nodesize, noderank;
MPI_Comm_size(nodecomm, &nodesize);
MPI_Comm_rank(nodecomm, &noderank);

MPI_Win win;
int size = (noderank == 0)? ARRAY_LEN * sizeof(int) : 0;
MPI_Win_allocate_shared(size, 1, MPI_INFO_NULL,
nodecomm, &baseptr, &win);

if (noderank != 0) {
MPI_Aint size;
int disp_unit;
MPI_Win_shared_query(win, 0, &size, &disp_unit, &baseptr);
}

for (int i = noderank; i < ARRAY_LEN; i += nodesize)
baseptr[i] = noderank;

MPI_Barrier(nodecomm);

if (noderank == 0) {
for (int i = 0; i < nodesize; i++)
printf("%4d", baseptr[i]);
printf("\n");
}
MPI_Win_free(&win);

MPI_Finalize();
}

现在,如果您将上面的代码命名为 test.cpp
mpic++ test.cpp && mpirun -n 8 ./a.out 将输出 0 1 2 3 4 5 6 7

<小时/>

我从 here 得到的一些正确提示

祝你好运!

关于azure - 如何修复 MPI_ERR_RMA_SHARED?,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/55763909/

24 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com