Dear PETSc users,

We are trying to understand an issue that has come up in running our code on a 
large cloud cluster with a large number of processes and subcomms.
This is code that we use daily on multiple clusters without problems, and that 
runs valgrind clean for small test problems.

The run generates the following messages, but doesn’t crash, just seems to hang 
with all processes continuing to show activity:

[492]PETSC ERROR: #1 PetscGatherMessageLengths() line 117 in 
/mnt/home/cgg/PETSc/petsc-3.12.4/src/sys/utils/mpimesg.c
[492]PETSC ERROR: #2 VecScatterSetUp_SF() line 658 in 
/mnt/home/cgg/PETSc/petsc-3.12.4/src/vec/vscat/impls/sf/vscatsf.c
[492]PETSC ERROR: #3 VecScatterSetUp() line 209 in 
/mnt/home/cgg/PETSc/petsc-3.12.4/src/vec/vscat/interface/vscatfce.c
[492]PETSC ERROR: #4 VecScatterCreate() line 282 in 
/mnt/home/cgg/PETSc/petsc-3.12.4/src/vec/vscat/interface/vscreate.c


Looking at line 117 in PetscGatherMessageLengths we find the offending 
statement is the MPI_Isend:

 
  /* Post the Isends with the message length-info */
  for (i=0,j=0; i<size; ++i) {
    if (ilengths[i]) {
      ierr = 
MPI_Isend((void*)(ilengths+i),1,MPI_INT,i,tag,comm,s_waits+j);CHKERRQ(ierr);
      j++;
    }
  } 

We have tried this with Intel MPI 2018, 2019, and mpich, all giving the same 
problem.

We suspect there is some limit being set on this cloud cluster on the number of 
file connections or something, but we don’t know.

Anyone have any ideas? We are sort of grasping for straws at this point.

Thanks, Randy M.

Reply via email to