That is a great idea. I'll figure it out. --Junchao Zhang
On Thu, Aug 13, 2020 at 4:31 PM Barry Smith <bsm...@petsc.dev> wrote: > > Junchao, > > Any way in the PETSc configure to warn that MPICH version is "bad" or > "untrustworthy" or even the vague "we have suspicians about this version > and recommend avoiding it"? A lot of time could be saved if others don't > deal with the same problem. > > Maybe add arrays of suspect_versions for OpenMPI, MPICH, etc and > always check against that list and print a boxed warning at configure time? > Better you could somehow generalize it and put it in package.py for use by > all packages, then any package can included lists of "suspect" versions. > (There are definitely HDF5 versions that should be avoided :-)). > > Barry > > > On Aug 13, 2020, at 12:14 PM, Junchao Zhang <junchao.zh...@gmail.com> > wrote: > > Thanks for the update. Let's assume it is a bug in MPI :) > --Junchao Zhang > > > On Thu, Aug 13, 2020 at 11:15 AM Chris Hewson <ch...@resfrac.com> wrote: > >> Just as an update to this, I can confirm that using the mpich version >> (3.3.2) downloaded with the petsc download solved this issue on my end. >> >> *Chris Hewson* >> Senior Reservoir Simulation Engineer >> ResFrac >> +1.587.575.9792 >> >> >> On Thu, Jul 23, 2020 at 5:58 PM Junchao Zhang <junchao.zh...@gmail.com> >> wrote: >> >>> On Mon, Jul 20, 2020 at 7:05 AM Barry Smith <bsm...@petsc.dev> wrote: >>> >>>> >>>> Is there a comprehensive MPI test suite (perhaps from MPICH)? Is >>>> there any way to run this full test suite under the problematic MPI and see >>>> if it detects any problems? >>>> >>>> Is so, could someone add it to the FAQ in the debugging section? >>>> >>> MPICH does have a test suite. It is at the subdir test/mpi of downloaded >>> mpich <http://www.mpich.org/static/downloads/3.3.2/mpich-3.3.2.tar.gz>. >>> It annoyed me since it is not user-friendly. It might be helpful in >>> catching bugs at very small scale. But say if I want to test allreduce on >>> 1024 ranks on 100 doubles, I have to hack the test suite. >>> Anyway, the instructions are here. >>> >>> For the purpose of petsc, under test/mpi one can configure it with >>> $./configure CC=mpicc CXX=mpicxx FC=mpifort --enable-strictmpi >>> --enable-threads=funneled --enable-fortran=f77,f90 --enable-fast >>> --disable-spawn --disable-cxx --disable-ft-tests // It is weird I disabled >>> cxx but I had to set CXX! >>> $make -k -j8 // -k is to keep going and ignore compilation errors, >>> e.g., when building tests for MPICH extensions not in MPI standard, but >>> your MPI is OpenMPI. >>> $ // edit testlist, remove lines mpi_t, rma, f77, impls. Those are >>> sub-dirs containing tests for MPI routines Petsc does not rely on. >>> $ make testings or directly './runtests -tests=testlist' >>> >>> On a batch system, >>> $export MPITEST_BATCHDIR=`pwd`/btest // specify a batch dir, say >>> btest, >>> $./runtests -batch -mpiexec=mpirun -np=1024 -tests=testlist // Use >>> 1024 ranks if a test does no specify the number of processes. >>> $ // It copies test binaries to the batch dir and generates a >>> script runtests.batch there. Edit the script to fit your batch system and >>> then submit a job and wait for its finish. >>> $ cd btest && ../checktests --ignorebogus >>> >>> >>> PS: Fande, changing an MPI fixed your problem does not necessarily mean >>> the old MPI has bugs. It is complicated. It could be a petsc bug. You need >>> to provide us a code to reproduce your error. It does not matter if the >>> code is big. >>> >>> >>>> Thanks >>>> >>>> Barry >>>> >>>> >>>> On Jul 20, 2020, at 12:16 AM, Fande Kong <fdkong...@gmail.com> wrote: >>>> >>>> Trace could look like this: >>>> >>>> [640]PETSC ERROR: --------------------- Error Message >>>> -------------------------------------------------------------- >>>> [640]PETSC ERROR: Argument out of range >>>> [640]PETSC ERROR: key 45226154 is greater than largest key allowed >>>> 740521 >>>> [640]PETSC ERROR: See >>>> https://www.mcs.anl.gov/petsc/documentation/faq.html for trouble >>>> shooting. >>>> [640]PETSC ERROR: Petsc Release Version 3.13.3, unknown >>>> [640]PETSC ERROR: ../../griffin-opt on a arch-moose named r6i5n18 by >>>> wangy2 Sun Jul 19 17:14:28 2020 >>>> [640]PETSC ERROR: Configure options --download-hypre=1 >>>> --with-debugging=no --with-shared-libraries=1 --download-fblaslapack=1 >>>> --download-metis=1 --download-ptscotch=1 --download-parmetis=1 >>>> --download-superlu_dist=1 --download-mumps=1 --download-scalapack=1 >>>> --download-slepc=1 --with-mpi=1 --with-cxx-dialect=C++11 >>>> --with-fortran-bindings=0 --with-sowing=0 --with-64-bit-indices >>>> --download-mumps=0 >>>> [640]PETSC ERROR: #1 PetscTableFind() line 132 in >>>> /home/wangy2/trunk/sawtooth/griffin/moose/petsc/include/petscctable.h >>>> [640]PETSC ERROR: #2 MatSetUpMultiply_MPIAIJ() line 33 in >>>> /home/wangy2/trunk/sawtooth/griffin/moose/petsc/src/mat/impls/aij/mpi/mmaij.c >>>> [640]PETSC ERROR: #3 MatAssemblyEnd_MPIAIJ() line 876 in >>>> /home/wangy2/trunk/sawtooth/griffin/moose/petsc/src/mat/impls/aij/mpi/mpiaij.c >>>> [640]PETSC ERROR: #4 MatAssemblyEnd() line 5347 in >>>> /home/wangy2/trunk/sawtooth/griffin/moose/petsc/src/mat/interface/matrix.c >>>> [640]PETSC ERROR: #5 MatPtAPNumeric_MPIAIJ_MPIXAIJ_allatonce() line 901 >>>> in >>>> /home/wangy2/trunk/sawtooth/griffin/moose/petsc/src/mat/impls/aij/mpi/mpiptap.c >>>> [640]PETSC ERROR: #6 MatPtAPNumeric_MPIAIJ_MPIMAIJ_allatonce() line >>>> 3180 in >>>> /home/wangy2/trunk/sawtooth/griffin/moose/petsc/src/mat/impls/maij/maij.c >>>> [640]PETSC ERROR: #7 MatProductNumeric_PtAP() line 704 in >>>> /home/wangy2/trunk/sawtooth/griffin/moose/petsc/src/mat/interface/matproduct.c >>>> [640]PETSC ERROR: #8 MatProductNumeric() line 759 in >>>> /home/wangy2/trunk/sawtooth/griffin/moose/petsc/src/mat/interface/matproduct.c >>>> [640]PETSC ERROR: #9 MatPtAP() line 9199 in >>>> /home/wangy2/trunk/sawtooth/griffin/moose/petsc/src/mat/interface/matrix.c >>>> [640]PETSC ERROR: #10 MatGalerkin() line 10236 in >>>> /home/wangy2/trunk/sawtooth/griffin/moose/petsc/src/mat/interface/matrix.c >>>> [640]PETSC ERROR: #11 PCSetUp_MG() line 745 in >>>> /home/wangy2/trunk/sawtooth/griffin/moose/petsc/src/ksp/pc/impls/mg/mg.c >>>> [640]PETSC ERROR: #12 PCSetUp_HMG() line 220 in >>>> /home/wangy2/trunk/sawtooth/griffin/moose/petsc/src/ksp/pc/impls/hmg/hmg.c >>>> [640]PETSC ERROR: #13 PCSetUp() line 898 in >>>> /home/wangy2/trunk/sawtooth/griffin/moose/petsc/src/ksp/pc/interface/precon.c >>>> [640]PETSC ERROR: #14 KSPSetUp() line 376 in >>>> /home/wangy2/trunk/sawtooth/griffin/moose/petsc/src/ksp/ksp/interface/itfunc.c >>>> [640]PETSC ERROR: #15 KSPSolve_Private() line 633 in >>>> /home/wangy2/trunk/sawtooth/griffin/moose/petsc/src/ksp/ksp/interface/itfunc.c >>>> [640]PETSC ERROR: #16 KSPSolve() line 853 in >>>> /home/wangy2/trunk/sawtooth/griffin/moose/petsc/src/ksp/ksp/interface/itfunc.c >>>> [640]PETSC ERROR: #17 SNESSolve_NEWTONLS() line 225 in >>>> /home/wangy2/trunk/sawtooth/griffin/moose/petsc/src/snes/impls/ls/ls.c >>>> [640]PETSC ERROR: #18 SNESSolve() line 4519 in >>>> /home/wangy2/trunk/sawtooth/griffin/moose/petsc/src/snes/interface/snes.c >>>> >>>> On Sun, Jul 19, 2020 at 11:13 PM Fande Kong <fdkong...@gmail.com> >>>> wrote: >>>> >>>>> I am not entirely sure what is happening, but we encountered similar >>>>> issues recently. It was not reproducible. It might occur at different >>>>> stages, and errors could be weird other than "ctable stuff." Our code was >>>>> Valgrind clean since every PR in moose needs to go through rigorous >>>>> Valgrind checks before it reaches the devel branch. The errors happened >>>>> when we used mvapich. >>>>> >>>>> We changed to use HPE-MPT (a vendor stalled MPI), then everything was >>>>> smooth. May you try a different MPI? It is better to try a system carried >>>>> one. >>>>> >>>>> We did not get the bottom of this problem yet, but we at least know >>>>> this is kind of MPI-related. >>>>> >>>>> Thanks, >>>>> >>>>> Fande, >>>>> >>>>> >>>>> On Sun, Jul 19, 2020 at 3:28 PM Chris Hewson <ch...@resfrac.com> >>>>> wrote: >>>>> >>>>>> Hi, >>>>>> >>>>>> I am having a bug that is occurring in PETSC with the return string: >>>>>> >>>>>> [7]PETSC ERROR: PetscTableFind() line 132 in >>>>>> /home/chewson/petsc-3.13.2/include/petscctable.h key 7556 is greater than >>>>>> largest key allowed 5693 >>>>>> >>>>>> This is using petsc-3.13.2, compiled and running using mpich with -O3 >>>>>> and debugging turned off tuned to the haswell architecture and >>>>>> occurring either before or during a KSPBCGS solve/setup or during a MUMPS >>>>>> factorization solve (I haven't been able to replicate this issue with the >>>>>> same set of instructions etc.). >>>>>> >>>>>> This is a terrible way to ask a question, I know, and not very >>>>>> helpful from your side, but this is what I have from a user's run and >>>>>> can't >>>>>> reproduce on my end (either with the optimization compilation or with >>>>>> debugging turned on). This happens when the code has run for quite some >>>>>> time and is happening somewhat rarely. >>>>>> >>>>>> More than likely I am using a static variable (code is written in >>>>>> c++) that I'm not updating when the matrix size is changing or something >>>>>> silly like that, but any help or guidance on this would be appreciated. >>>>>> >>>>>> *Chris Hewson* >>>>>> Senior Reservoir Simulation Engineer >>>>>> ResFrac >>>>>> +1.587.575.9792 >>>>>> >>>>> >>>> >