To whom it may concern:
I recently tried to use the 64 indices PETSc to replace the legacy code's solver using MPI linear solver server. However, it gives me error when I use more than 8 cores, saying Get NNZ MatsetPreallocation MatsetValue MatSetValue Time per kernel: 43.1147 s Matassembly VecsetValue pestc_solve Read -1, expected 1951397280, errno = 14 When I tried the -start_in_debugger, the error seems from MPI_Scatter: Rank0: #3 0x00001555512e4de5 in mca_pml_ob1_recv () from /usr/lib/x86_64-linux-gnu/openmpi/lib/openmpi3/mca_pml_ob1.so #4 0x0000155553e01e60 in PMPI_Scatterv () from /lib/x86_64-linux-gnu/libmpi.so.40 #5 0x0000155554b13eab in PCMPISetMat (pc=pc@entry=0x0) at /auto/research/rdfs/home/lyuxiang/petsc-3.20.4/src/ksp/pc/impls/mpi/pcmpi.c:230 #6 0x0000155554b17403 in PCMPIServerBegin () at /auto/research/rdfs/home/lyuxiang/petsc-3.20.4/src/ksp/pc/impls/mpi/pcmpi.c:464 #7 0x00001555540b9aa4 in PetscInitialize_Common (prog=0x7fffffffe27b "geosimtrs_mpiserver", file=file@entry=0x0, help=help@entry=0x55555555a1e0 <help> "Solves a linear system in parallel with KSP.\nInput parameters include:\n -view_exact_sol : write exact solution vector to stdout\n -m <mesh_x> : number of mesh points in x-direction\n -n <mesh"..., ftn=ftn@entry=PETSC_FALSE, readarguments=readarguments@entry=PETSC_FALSE, len=len@entry=0) at /auto/research/rdfs/home/lyuxiang/petsc-3.20.4/src/sys/objects/pinit.c:1109 #8 0x00001555540bba82 in PetscInitialize (argc=argc@entry=0x7fffffffda8c, args=args@entry=0x7fffffffda80, file=file@entry=0x0, help=help@entry=0x55555555a1e0 <help> "Solves a linear system in parallel with KSP.\nInput parameters include:\n -view_exact_sol : write exact solution vector to stdout\n -m <mesh_x> : number of mesh points in x-direction\n -n <mesh"...) at /auto/research/rdfs/home/lyuxiang/petsc-3.20.4/src/sys/objects/pinit.c:1274 #9 0x0000555555557673 in main (argc=<optimized out>, args=<optimized out>) at geosimtrs_mpiserver.c:29 Rank1-10 0x0000155553e1f030 in ompi_coll_base_allgather_intra_bruck () from /lib/x86_64-linux-gnu/libmpi.so.40 #4 0x0000155550f62aaa in ompi_coll_tuned_allgather_intra_dec_fixed () from /usr/lib/x86_64-linux-gnu/openmpi/lib/openmpi3/mca_coll_tuned.so #5 0x0000155553ddb431 in PMPI_Allgather () from /lib/x86_64-linux-gnu/libmpi.so.40 #6 0x00001555541a2289 in PetscLayoutSetUp (map=0x555555721ed0) at /auto/research/rdfs/home/lyuxiang/petsc-3.20.4/src/vec/is/utils/pmap.c:248 #7 0x000015555442e06a in MatMPIAIJSetPreallocationCSR_MPIAIJ (B=0x55555572d850, Ii=0x15545a778010, J=0x15545beacb60, v=0x1554cff55e60) at /auto/research/rdfs/home/lyuxiang/petsc-3.20.4/src/mat/impls/aij/mpi/mpiaij.c:3885 #8 0x00001555544284e3 in MatMPIAIJSetPreallocationCSR (B=0x55555572d850, i=0x15545a778010, j=0x15545beacb60, v=0x1554cff55e60) at /auto/research/rdfs/home/lyuxiang/petsc-3.20.4/src/mat/impls/aij/mpi/mpiaij.c:3998 #9 0x0000155554b1412f in PCMPISetMat (pc=pc@entry=0x0) at /auto/research/rdfs/home/lyuxiang/petsc-3.20.4/src/ksp/pc/impls/mpi/pcmpi.c:250 #10 0x0000155554b17403 in PCMPIServerBegin () at /auto/research/rdfs/home/lyuxiang/petsc-3.20.4/src/ksp/pc/impls/mpi/pcmpi.c:464 #11 0x00001555540b9aa4 in PetscInitialize_Common (prog=0x7fffffffe27b "geosimtrs_mpiserver", file=file@entry=0x0, help=help@entry=0x55555555a1e0 <help> "Solves a linear system in parallel with KSP.\nInput parameters include:\n -view_exact_sol : write exact solution vector to stdout\n -m <mesh_x> : number of mesh points in x-direction\n -n <mesh"..., ftn=ftn@entry=PETSC_FALSE, readarguments=readarguments@entry=PETSC_FALSE, len=len@entry=0) at /auto/research/rdfs/home/lyuxiang/petsc-3.20.4/src/sys/objects/pinit.c:1109 #12 0x00001555540bba82 in PetscInitialize (argc=argc@entry=0x7fffffffda8c, args=args@entry=0x7fffffffda80, file=file@entry=0x0, help=help@entry=0x55555555a1e0 <help> "Solves a linear system in parallel with KSP.\nInput parameters include:\n -view_exact_sol : write exact solution vector to stdout\n -m <mesh_x> : number of mesh points in x-direction\n -n <mesh"...) at /auto/research/rdfs/home/lyuxiang/petsc-3.20.4/src/sys/objects/pinit.c:1274 #13 0x0000555555557673 in main (argc=<optimized out>, args=<optimized out>) at geosimtrs_mpiserver.c:29 This did not happen in 32bit indiced PETSc, running with more than 8 cores runs smoothly using MPI linear solver server, nor did it happen on 64 bit indiced MPI version (not with mpi_linear_solver_server), only happens on 64 bit PETSc mpi linear solver server, I think it maybe a potential bug? Any advice would be greatly appreciated, the matrix and ia, ja is too big to upload, so if anything you need to debug pls let me know - Machine type: HPC - OS version and type: Linux houamd009 6.1.55-cggdb11-1 #1 SMP Fri Sep 29 10:09:13 UTC 2023 x86_64 GNU/Linux - PETSc version: #define PETSC_VERSION_RELEASE 1 #define PETSC_VERSION_MAJOR 3 #define PETSC_VERSION_MINOR 20 #define PETSC_VERSION_SUBMINOR 4 #define PETSC_RELEASE_DATE "Sep 28, 2023" #define PETSC_VERSION_DATE "Jan 29, 2024" - MPI implementation: OpenMPI - Compiler and version: GNU Yuxiang Lin