C、Open MPI : segmentation fault from call to MPI_Finalize(). 段错误并不总是发生，尤其是在进程数量较少的情况下-6ren

C、Open MPI : segmentation fault from call to MPI_Finalize(). 段错误并不总是发生，尤其是在进程数量较少的情况下

转载作者：行者123 更新时间：2023-12-01 12:54:48

我正在编写一个简单的代码来学习如何定义 MPI_Datatype 并将其与 MPI_Gatherv 结合使用。我想确保我可以在一个进程中组合可变长度、动态分配的结构化数据数组，这似乎工作正常，直到我调用 MPI_Finalize()。我已经通过使用打印语句和 Eclipse PTP 调试器(后端是 gdb-mi)确认这是问题开始显现的地方。我的主要问题是，如何摆脱段错误？

每次运行代码时都不会发生段错误。例如，它没有发生在 2 或 3 个进程中，但是当我运行大约 4 个或更多进程时往往会定期发生。

此外，当我使用 valgrind 运行此代码时，不会发生段错误。但是，我确实收到了来自 valgrind 的错误消息，尽管我在使用 MPI 函数时很难理解输出，即使有大量有针对性的抑制也是如此。我还担心如果我使用更多的抑制，我会使有用的错误消息静音。

我使用这些标志编译普通代码，所以我在两种情况下都使用 C99 标准:
-ansi -pedantic -Wall -O2 -march=barcelona -fomit-frame-pointer -std=c99
和调试代码:
-ansi -pedantic -std=c99 -Wall -g

两者都使用 gcc 4.4 mpicc 编译器，并在使用 Red Hat Linux 和 Open MPI v1.4.5 的集群上运行。如果我遗漏了其他重要的信息，请告诉我。这是代码，提前致谢:

//#include <unistd.h>
#include <string.h>
#include <stdio.h>
#include <math.h>
#include <stdlib.h>
//#include <limits.h>

#include "mpi.h"

#define FULL_PROGRAM        1

struct CD{

    int int_ID;
    double dbl_ID;
};

int main(int argc, char *argv[]) {

    int numprocs, myid, ERRORCODE;

#if FULL_PROGRAM
    struct CD *myData=NULL;             //Each process contributes an array of data, comprised of 'struct CD' elements
    struct CD *allData=NULL;            //root will dynamically allocate this array to store all the data from rest of the processes
    int *p_lens=NULL, *p_disp=NULL;     //p_lens stores the number of elements in each process' array, p_disp stores the displacements in bytes
    int MPI_CD_size;                    //stores the size of the MPI_Datatype that is defined to allow communication operations using 'struct CD' elements

    int mylen, total_len=0;             //mylen should be the length of each process' array
                                        //MAXlen is the maximum allowable array length
                                        //total_len will be the sum of mylen across all processes

    // ============ variables related to defining new MPI_Datatype at runtime ====================================================
    struct CD sampleCD = {.int_ID=0, .dbl_ID=0.0};
    int blocklengths[2];                //this describes how many blocks of identical data types will be in the new MPI_Datatype
    MPI_Aint offsets[2];                //this stores the offsets, in bytes(bits?), of the blocks from the 'start' of the datatype
    MPI_Datatype block_types[2];        //this stores which built-in data types the blocks are comprised of
    MPI_Datatype  myMPI_CD;             //just the name of the new datatype
    MPI_Aint myStruct_address, int_ID_address, dbl_ID_address, int_offset, dbl_offset;  //useful place holders for filling the arrays above
    // ===========================================================================================================================
#endif
    // =================== Initializing MPI functionality ============================
    MPI_Init(&argc, &argv);
    MPI_Comm_size(MPI_COMM_WORLD, &numprocs);
    MPI_Comm_rank(MPI_COMM_WORLD, &myid);
    // ===============================================================================
#if FULL_PROGRAM
    // ================== This part actually formally defines the MPI datatype ===============================================
    MPI_Get_address(&sampleCD, &myStruct_address);          //starting point of struct CD
    MPI_Get_address(&sampleCD.int_ID, &int_ID_address);     //starting point of first entry in CD
    MPI_Get_address(&sampleCD.dbl_ID, &dbl_ID_address);     //starting point of second entry in CD
    int_offset = int_ID_address - myStruct_address;         //offset from start of first to start of CD
    dbl_offset = dbl_ID_address - myStruct_address;         //offset from start of second to start of CD

    blocklengths[0]=1;  blocklengths[1]=1;                  //array telling it how many blocks of identical data types there are, and the number of entries in each block
    //This says there are two blocks of identical data-types, and both blocks have only one variable in them

    offsets[0]=int_offset;  offsets[1]=dbl_offset;          //the first block starts at int_offset, the second block starts at dbl_offset (from 'myData_address'

    block_types[0]=MPI_INT; block_types[1]=MPI_DOUBLE;      //the first block contains MPI_INT, the second contains MPI_DOUBLE

    MPI_Type_create_struct(2, blocklengths, offsets, block_types, &myMPI_CD);       //this uses the above arrays to define the MPI_Datatype...an MPI-2 function

    MPI_Type_commit(&myMPI_CD);     //this is the final step to defining/reserving the data type
    // ========================================================================================================================

    mylen   = myid*2;       //each process is told how long its array should be...I used to define that randomly but that just makes things messier

    p_lens  = (int*)        calloc((size_t)numprocs,    sizeof(int));       //allocate memory for the number of elements (p_lens) and offsets from the start of the recv buffer(d_disp)
    p_disp  = (int*)        calloc((size_t)numprocs,    sizeof(int));

    myData  = (struct CD*)  calloc((size_t)mylen,       sizeof(struct CD));         //allocate memory for each process' array
    //if mylen==0, 'a unique pointer to the heap is returned'

    if(!p_lens) {   MPI_Abort(MPI_COMM_WORLD, 1); exit(EXIT_FAILURE);   }
    if(!p_disp) {   MPI_Abort(MPI_COMM_WORLD, 1); exit(EXIT_FAILURE);   }
    if(!myData) {   MPI_Abort(MPI_COMM_WORLD, 1); exit(EXIT_FAILURE);   }


    for(double temp=0.0;temp<1e6;++temp) temp += exp(-10.0);
    MPI_Barrier(MPI_COMM_WORLD);                                //purely for keeping the output organized by give a delay in time

    for (int k=0; k<numprocs; ++k) {

        if(myid==k) {

            //printf("\t ID %d has %d entries: { ", myid, mylen);

            for(int i=0; i<mylen; ++i) {

                myData[i]= (struct CD) {.int_ID=myid*(i+1), .dbl_ID=myid*(i+1)};            //fills data elements with simple pattern
                //printf("%d: (%d,%lg) ", i, myData[i].int_ID, myData[i].dbl_ID);
            }
            //printf("}\n");
        }
    }

    for(double temp=0.0;temp<1e6;++temp) temp += exp(-10.0);
    MPI_Barrier(MPI_COMM_WORLD);                            //purely for keeping the output organized by give a delay in time

    MPI_Gather(&mylen,  1, MPI_INT, p_lens, 1, MPI_INT, 0, MPI_COMM_WORLD);     //Each process sends root the length of the vector they'll be sending

#if 1
    MPI_Type_size(myMPI_CD, &MPI_CD_size);          //gets the size of the MPI_Datatype for p_disp
#else
    MPI_CD_size = sizeof(struct CD);                //using this doesn't change things too much...
#endif

    for(int j=0;j<numprocs;++j) {

        total_len += p_lens[j];

        if (j==0)   {   p_disp[j] = 0;                                      }
        else        {   p_disp[j] = p_disp[j-1] + p_lens[j]*MPI_CD_size;    }
    }

    if (myid==0)    {

        allData = (struct CD*)  calloc((size_t)total_len,   sizeof(struct CD));     //allocate array
        if(!allData)    {   MPI_Abort(MPI_COMM_WORLD, 1); exit(EXIT_FAILURE);   }
    }

    MPI_Gatherv(myData, mylen, myMPI_CD, allData, p_lens, p_disp, myMPI_CD, 0, MPI_COMM_WORLD); //each array sends root process their array, which is stored in 'allData'

    // ============================== OUTPUT CONFIRMING THAT COMMUNICATIONS WERE SUCCESSFUL=========================================
    if(myid==0) {

        for(int i=0;i<numprocs;++i) {
            printf("\n\tElements from %d on MASTER are: { ",i);
            for(int k=0;k<p_lens[i];++k)    {   printf("%d: (%d,%lg) ", k, (allData+p_disp[i]+k)->int_ID, (allData+p_disp[i]+k)->dbl_ID);   }

            if(p_lens[i]==0) printf("NOTHING ");
            printf("}\n");
        }
        printf("\n");       //each data element should appear as two identical numbers, counting upward by the process ID
    }
    // ==========================================================================================================

    if (p_lens) {   free(p_lens);   p_lens=NULL;    }       //adding this in didn't get rid of the MPI_Finalize seg-fault
    if (p_disp) {   free(p_disp);   p_disp=NULL;    }
    if (myData) {   free(myData);   myData=NULL;    }
    if (allData){   free(allData);  allData=NULL;   }       //the if statement ensures that processes not allocating memory for this pointer don't free anything

    for(double temp=0.0;temp<1e6;++temp) temp += exp(-10.0);
    MPI_Barrier(MPI_COMM_WORLD);                            //purely for keeping the output organized by give a delay in time
    printf("ID %d: I have reached the end...before MPI_Type_free!\n", myid);

    // ====================== CLEAN UP ================================================================================
    ERRORCODE = MPI_Type_free(&myMPI_CD);           //this frees the data type...not always necessary, but a good practice

    for(double temp=0.0;temp<1e6;++temp) temp += exp(-10.0);
    MPI_Barrier(MPI_COMM_WORLD);                                //purely for keeping the output organized by give a delay in time

    if(ERRORCODE!=MPI_SUCCESS)  {   printf("ID %d...MPI_Type_free was not successful\n", myid); MPI_Abort(MPI_COMM_WORLD, 911); exit(EXIT_FAILURE); }
    else                        {   printf("ID %d...MPI_Type_free was successful, entering MPI_Finalize...\n", myid);       }
#endif
    ERRORCODE=MPI_Finalize();

    for(double temp=0.0;temp<1e7;++temp) temp += exp(-10.0);        //NO MPI_Barrier AFTER MPI_Finalize!

    if(ERRORCODE!=MPI_SUCCESS)  {   printf("ID %d...MPI_Finalize was not successful\n", myid);  MPI_Abort(MPI_COMM_WORLD, 911); exit(EXIT_FAILURE); }
    else                        {   printf("ID %d...MPI_Finalize was successful\n", myid);      }

    return EXIT_SUCCESS;
}

最佳答案

k 上的外循环是假的，但在技术上并没有错——它只是没用。

真正的问题是您对 MPI_GATHERV 的位移是错误的。如果您运行 valgrind，您将看到如下内容:

==28749== Invalid write of size 2
==28749==    at 0x4A086F4: memcpy (mc_replace_strmem.c:838)
==28749==    by 0x4C69614: unpack_predefined_data (datatype_unpack.h:41)
==28749==    by 0x4C6B336: ompi_generic_simple_unpack (datatype_unpack.c:418)
==28749==    by 0x4C7288F: ompi_convertor_unpack (convertor.c:314)
==28749==    by 0x8B295C7: mca_pml_ob1_recv_frag_callback_match (pml_ob1_recvfrag.c:216)
==28749==    by 0x935723C: mca_btl_sm_component_progress (btl_sm_component.c:426)
==28749==    by 0x51D4F79: opal_progress (opal_progress.c:207)
==28749==    by 0x8B225CA: opal_condition_wait (condition.h:99)
==28749==    by 0x8B22718: ompi_request_wait_completion (request.h:375)
==28749==    by 0x8B231E1: mca_pml_ob1_recv (pml_ob1_irecv.c:104)
==28749==    by 0x955E7A7: mca_coll_basic_gatherv_intra (coll_basic_gatherv.c:85)
==28749==    by 0x9F7CBFA: mca_coll_sync_gatherv (coll_sync_gatherv.c:46)
==28749==  Address 0x7b1d630 is not stack'd, malloc'd or (recently) free'd

表明 MPI_GATHERV 以某种方式获得了错误信息。

(还有来自 Open MPI 中的 libltdl 的其他 valgrind 警告，不幸的是这是不可避免的——它是 libltdl 中的一个错误，另一个来自 PLPA，不幸的是，这也是不可避免的，因为它是故意这样做的[原因不值得讨论这里])

看看你的位移计算，我明白了

    total_len += p_lens[j];                                                              

    if (j == 0) {                                                                        
        p_disp[j] = 0;                                                                   
    } else {                                                                             
        p_disp[j] = p_disp[j - 1] + p_lens[j] * MPI_CD_size;                             
    }

但是 MPI 收集位移以数据类型为单位，而不是字节。所以它真的应该是:

p_disp[j] = total_len;
total_len += p_lens[j];

进行此更改使 MPI_GATHERV valgrind 警告对我来说消失了。

关于C、Open MPI : segmentation fault from call to MPI_Finalize(). 段错误并不总是发生，尤其是在进程数量较少的情况下，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/10406438/

文章推荐： java - 为什么 Apache Cassandra 无法在 Windows XP 上启动？

文章推荐： java - 无法将数据插入数据库 - android

文章推荐： r - text() R-function - 如何更改单个单词的字体？

发生 VBA 编译错误
下面的代码旨在在首次打开工作簿时运行。 Sub Auto_Open() Dim LastRow As Integer LastRow = Sheet6.UsedRange.Rows.Count Act
发生 C++ 堆损坏检测错误
当我尝试操作我的代码时，除了弹出调试错误外，它执行得很好。错误信息在这里。我的完整代码在这里。 #include using namespace std; class String { publi
c# - 发生 XMLParseException
The invocation of the constructor on type 'WpfApplication1.MainWindow' that matches the specified bi
android - 发生 ArrayIndexOutOfBoundsException
我正在使用 BaseAdapter: public class MyAdapter extends BaseAdapter{ private final LayoutInflater mInflate
mysql - 发生 ER_PARSE_ERROR
我想做网页抓取。我写了代码 var connection = require('./mysqlConnection'); var c = new Crawler({ maxConnections
发生 Java 堆空间错误
我的系统中发生 Java 堆空间错误。我尝试了很多来自 Stack Overflow 的解决方案，但没有任何效果。当我工作时当按下 OK 然后 (我的项目没有错误) 我的 eclipse.ini 是
c++ - D3DXERR_INVALIDDATA 发生
环境: i5 750 DDR3 4GWin7 专业版 x64 sp1 DXSDK 9.0c 2010 年 6 月 GeForce GT240(驱动程序 275.33)512MB MSVC 2008 s
发生 Python 套接字错误
这段代码是我写的。 import socket host = 'localhost' port = 3794 s = socket.socket(socket.AF_INET, socket.SOCK
c# - 发生 DateTimeInvalidLocalFormat
我正在尝试引用 UTC 时间间隔获取本地日期时间，我正在执行下面的代码。 var dtString =DateTime.UtcNow.ToString(@"yyyy-MM-ddTHH\:mm\:ss
c# - LoadFromContext 发生
我有一个非常简单的 C# 问题，它从库中加载 Windows WPF 窗口。这是代码: public partial class App : Application { public App(
android - 发生 fragment 加载闪烁时带有导航组件的底部导航
我目前正在使用带有导航组件的底部导航，它工作正常但是当我们点击导航项 fragment 正在加载然后闪烁正在发生，即使当前选择的项目也会发生闪烁。它在加载 fragment 时发生。我的应用程序屏幕背
nullpointerexception - Kotlin NullPointerException 发生
我是新来的 kotlin , 当我开始 Null Safety 时，我对下面的情况感到困惑. There's some data inconsistency with regard to initia
css - 发生 css 转换时如何阻止我的文本移动
我有一个框，其中包含同时发生的两个独立的 css 转换。当转换发生时，图标下方的标题和段落文本移动位置参见 JS Fiddle:http://jsfiddle.net/Lsnbpt8r/ 这是我的
cordova - 发生 native 打包程序异常
在为黑莓 10 构建电话间隙应用程序时，我遇到了异常情况。 [BUILD] Populating application source [BUILD] Parsing config.xml [
java - 发生 JNI 代码错误时如何正确停止线程？
这个问题在这里已经有了答案: How to properly stop the Thread in Java? (8 个回答) 3年前关闭。我看过How to properly stop the T
发生 fatal error 时php重新加载页面
我试图弄清楚发生 fatal error 时如何刷新页面。基本上我正在访问图像 api 并将图像复制到我的服务器。我还每次都创建照片的缩略图版本。我会每隔一段时间收到一条错误消息，指出我的脚本试图分配
java - 使用断言检查元素是否在屏幕上，发生 NoSuchElementException
我正在尝试使用断言函数检查元素是否在屏幕上。我在我的测试应用程序 (AndroidDriver) 中使用 Appium 和 Java。我期望的是，如果元素在屏幕上，则返回 1；如果不在屏幕上，则返回
java - 发生 MaxUploadSizeExceededException 时如何关闭套接字？
我正在开发图像上传系统。我使用 CommonsMultipartResolver 设置 maxUploadSize。当我尝试上传超过最大尺寸的图像文件时，会发生 MaxUploadSizeExcced
java - 发生 UnsatisfiedDependencyException 错误
我有以下代码和@ComponentScan(basePackages = "com.project.shopping")，包结构为 com.project.shopping.Controller co
java - 发生 JNI 错误
我尝试运行此程序作为测试，但收到错误“发生了 JNI 错误，请检查您的安装并重试”，然后是“发生了 Java 异常”。关于如何解决这个问题有什么想法吗？ package java; public cl

行者123

个人简介

我是一名优秀的程序员,十分优秀！

作者热门文章

滴滴打车优惠券免费领取

全站热门文章

首页

博学

6Ren·AI

商城

C、Open MPI : segmentation fault from call to MPI_Finalize(). 段错误并不总是发生，尤其是在进程数量较少的情况下