On Thu, Sep 24, 2020 at 3:42 PM Barry Smith <bsm...@petsc.dev> wrote:
> > The stack is listed below. It crashes inside MatPtAP(). > What about just checking that the column indices that PtAP receives are valid? Are we not doing that? Matt > It is possible there is some subtle bug in the rather complex PETSc code > for MatPtAP() but I am included to blame MPI again. > > I think we should add some simple low-overhead always on communication > error detecting code to PetscSF where some check sums are also communicated > at the highest level of PetscSF(). > > I don't know how but perhaps when the data is packed per destination > rank a checksum is computed and when unpacked the checksum is compared > using extra space at the end of the communicated packed array to store and > send the checksum. Yes, it is kind of odd for a high level library like > PETSc to not trust the communication channel but MPI implementations have > proven themselves to not be trustworthy and adding this to PetscSF is not > intrusive to the PETSc API or user. Plus it gives a definitive yes or no as > to the problem being from an error in the communication. > > Barry > > On Sep 24, 2020, at 12:35 PM, Matthew Knepley <knep...@gmail.com> wrote: > > On Thu, Sep 24, 2020 at 1:22 PM Chris Hewson <ch...@resfrac.com> wrote: > >> Hi Guys, >> >> Thanks for all of the prompt responses, very helpful and appreciated. >> >> By "when debugging", did you mean when configure petsc --with-debugging=1 >> COPTFLAGS=-O0 -g etc or when you attached a debugger? >> - Both, I have run with a debugger attached and detached, all compiled >> with the following flags "--with-debugging=1 COPTFLAGS=-O0 -g" >> >> 1) Try OpenMPI (probably won't help, but worth trying) >> - Worth a try for sure >> >> 2) Find which part of the simulation makes it non-deterministic. Is it >> the mesh partitioning (parmetis)? Then try to make it deterministic. >> - Good tip, it is the mesh partitioning and along the lines of a question >> from Barry, the matrix size is changing. I will make this deterministic and >> give it a try >> >> 3) Dump matrices, vectors, etc and see when it fails, you can quickly >> reproduce the error by reading in the intermediate data. >> - Also a great suggestion, will give it a try >> >> The full stack would be really useful here. I am guessing this happens on >> MatMult(), but I do not know. >> - Agreed, I am currently running it so that the full stack will be >> produced, but waiting for it to fail, had compiled with all available >> optimizations on, but downside is of course if there is a failure. >> As a general question, roughly what's the performance impact on petsc >> with -o1/-o2/-o0 as opposed to -o3? Performance impact of --with-debugging >> = 1? >> Obviously problem/machine dependant, wondering on guidance more for this >> than anything >> >> Is the nonzero structure of your matrices changing or is it fixed for the >> entire simulation? >> The non-zero structure is changing, although the structures are reformed >> when this happens and this happens thousands of time before the failure has >> occured. >> > > Okay, this is the most likely spot for a bug. How are you changing the > matrix? It should be impossible to put in an invalid column index when > using MatSetValues() > because we check all the inputs. However, I do not think we check when you > just yank out the arrays. > > Thanks, > > Matt > > >> Does this particular run always crash at the same place? Similar place? >> Doesn't always crash? >> Doesn't always crash, but other similar runs have crashed in different >> spots, which makes it difficult to resolve. I am going to try out a few of >> the strategies suggested above and will let you know what comes of that. >> >> *Chris Hewson* >> Senior Reservoir Simulation Engineer >> ResFrac >> +1.587.575.9792 >> >> >> On Thu, Sep 24, 2020 at 11:05 AM Barry Smith <bsm...@petsc.dev> wrote: >> >>> Chris, >>> >>> We realize how frustrating this type of problem is to deal with. >>> >>> Here is the code: >>> >>> ierr = >>> PetscTableCreate(aij->B->rmap->n,mat->cmap->N+1,&gid1_lid1);CHKERRQ(ierr); >>> for (i=0; i<aij->B->rmap->n; i++) { >>> for (j=0; j<B->ilen[i]; j++) { >>> PetscInt data,gid1 = aj[B->i[i] + j] + 1; >>> ierr = PetscTableFind(gid1_lid1,gid1,&data);CHKERRQ(ierr); >>> if (!data) { >>> /* one based table */ >>> ierr = >>> PetscTableAdd(gid1_lid1,gid1,++ec,INSERT_VALUES);CHKERRQ(ierr); >>> } >>> } >>> } >>> >>> It is simply looping over the rows of the sparse matrix putting the >>> columns it finds into the hash table. >>> >>> aj[B->i[i] + j] are the column entries, the number of columns in the >>> matrix is mat->cmap->N so the column entries should always be >>> less than the number of columns. The code is crashing when column entry >>> 24443 which is larger than the number of columns 23988. >>> >>> So either the aj[B->i[i] + j] + 1 are incorrect or the mat->cmap->N is >>> incorrect. >>> >>> 640]PETSC ERROR: #3 MatAssemblyEnd_MPIAIJ() line 876 in >>>>>>>> /home/wangy2/trunk/sawtooth/griffin/moose/petsc/src/mat/impls/aij/mpi/mpiaij.c >>>>>>>> >>>>>>>> >>> if (!mat->was_assembled && mode == MAT_FINAL_ASSEMBLY) { >>> ierr = MatSetUpMultiply_MPIAIJ(mat);CHKERRQ(ierr); >>> } >>> >>> Seems to indicate it is setting up a new multiple because it is either >>> the first time into the algorithm or the nonzero structure changed on some >>> rank requiring a new assembly process. >>> >>> Is the nonzero structure of your matrices changing or is it fixed >>> for the entire simulation? >>> >>> Since the code has been running for a very long time already I have to >>> conclude that this is not the first time through and so something has >>> changed in the matrix? >>> >>> I think we have to put more diagnostics into the library to provide more >>> information before or at the time of the error detection. >>> >>> Does this particular run always crash at the same place? Similar >>> place? Doesn't always crash? >>> >>> Barry >>> >>> >>> >>> >>> On Sep 24, 2020, at 8:46 AM, Chris Hewson <ch...@resfrac.com> wrote: >>> >>> After about a month of not having this issue pop up, it has come up again >>> >>> We have been struggling with a similar PETSc Error for awhile now, the >>> error message is as follows: >>> >>> [7]PETSC ERROR: PetscTableFind() line 132 in >>> /home/chewson/petsc-3.13.3/include/petscctable.h key 24443 is greater than >>> largest key allowed 23988 >>> >>> It is a particularly nasty bug as it doesn't reproduce itself when >>> debugging and doesn't happen all the time with the same inputs either. The >>> problem occurs after a long runtime of the code (12-40 hours) and we are >>> using a ksp solve with KSPBCGS. >>> >>> The PETSc compilation options that are used are: >>> >>> --download-metis >>> --download-mpich >>> --download-mumps >>> --download-parmetis >>> --download-ptscotch >>> --download-scalapack >>> --download-suitesparse >>> --prefix=/opt/anl/petsc-3.13.3 >>> --with-debugging=0 >>> --with-mpi=1 >>> COPTFLAGS=-O3 -march=haswell -mtune=haswell >>> CXXOPTFLAGS=-O3 -march=haswell -mtune=haswell >>> FOPTFLAGS=-O3 -march=haswell -mtune=haswell >>> >>> This is being run across 8 processes and is failing consistently on the >>> rank 7 process. We also use openmp outside of PETSC and the linear solve >>> portion of the code. The rank 0 process is always using compute, during >>> this the slave processes use an MPI_Wait call to wait for the collective >>> parts of the code to be called again. All PETSC calls are done across all >>> of the processes. >>> >>> We are using mpich version 3.3.2, downloaded with the petsc 3.13.3 >>> package. >>> >>> At every PETSC call we are checking the error return from the function >>> collectively to ensure that no errors have been returned from PETSC. >>> >>> Some possible causes that I can think of are as follows: >>> 1. Memory leak causing a corruption either in our program or in petsc or >>> with one of the petsc objects. This seems unlikely as we have checked runs >>> with the option -malloc_dump for PETSc and using valgrind. >>> >>> 2. Optimization flags set for petsc compilation are causing variables >>> that go out of scope to be optimized out. >>> >>> 3. We are giving the wrong number of elements for a process or the value >>> is changing during the simulation. This seems unlikely as there is nothing >>> overly unique about these simulations and it's not reproducing itself. >>> >>> 4. An MPI channel or socket error causing an error in the collective >>> values for PETSc. >>> >>> Any input on this issue would be greatly appreciated. >>> >>> *Chris Hewson* >>> Senior Reservoir Simulation Engineer >>> ResFrac >>> +1.587.575.9792 >>> >>> >>> On Thu, Aug 13, 2020 at 4:21 PM Junchao Zhang <junchao.zh...@gmail.com> >>> wrote: >>> >>>> That is a great idea. I'll figure it out. >>>> --Junchao Zhang >>>> >>>> >>>> On Thu, Aug 13, 2020 at 4:31 PM Barry Smith <bsm...@petsc.dev> wrote: >>>> >>>>> >>>>> Junchao, >>>>> >>>>> Any way in the PETSc configure to warn that MPICH version is "bad" >>>>> or "untrustworthy" or even the vague "we have suspicians about this >>>>> version >>>>> and recommend avoiding it"? A lot of time could be saved if others don't >>>>> deal with the same problem. >>>>> >>>>> Maybe add arrays of suspect_versions for OpenMPI, MPICH, etc and >>>>> always check against that list and print a boxed warning at configure >>>>> time? >>>>> Better you could somehow generalize it and put it in package.py for use by >>>>> all packages, then any package can included lists of "suspect" versions. >>>>> (There are definitely HDF5 versions that should be avoided :-)). >>>>> >>>>> Barry >>>>> >>>>> >>>>> On Aug 13, 2020, at 12:14 PM, Junchao Zhang <junchao.zh...@gmail.com> >>>>> wrote: >>>>> >>>>> Thanks for the update. Let's assume it is a bug in MPI :) >>>>> --Junchao Zhang >>>>> >>>>> >>>>> On Thu, Aug 13, 2020 at 11:15 AM Chris Hewson <ch...@resfrac.com> >>>>> wrote: >>>>> >>>>>> Just as an update to this, I can confirm that using the mpich version >>>>>> (3.3.2) downloaded with the petsc download solved this issue on my end. >>>>>> >>>>>> *Chris Hewson* >>>>>> Senior Reservoir Simulation Engineer >>>>>> ResFrac >>>>>> +1.587.575.9792 >>>>>> >>>>>> >>>>>> On Thu, Jul 23, 2020 at 5:58 PM Junchao Zhang < >>>>>> junchao.zh...@gmail.com> wrote: >>>>>> >>>>>>> On Mon, Jul 20, 2020 at 7:05 AM Barry Smith <bsm...@petsc.dev> >>>>>>> wrote: >>>>>>> >>>>>>>> >>>>>>>> Is there a comprehensive MPI test suite (perhaps from MPICH)? >>>>>>>> Is there any way to run this full test suite under the problematic MPI >>>>>>>> and >>>>>>>> see if it detects any problems? >>>>>>>> >>>>>>>> Is so, could someone add it to the FAQ in the debugging >>>>>>>> section? >>>>>>>> >>>>>>> MPICH does have a test suite. It is at the subdir test/mpi of >>>>>>> downloaded mpich >>>>>>> <http://www.mpich.org/static/downloads/3.3.2/mpich-3.3.2.tar.gz>. >>>>>>> It annoyed me since it is not user-friendly. It might be helpful in >>>>>>> catching bugs at very small scale. But say if I want to test allreduce >>>>>>> on >>>>>>> 1024 ranks on 100 doubles, I have to hack the test suite. >>>>>>> Anyway, the instructions are here. >>>>>>> >>>>>>> For the purpose of petsc, under test/mpi one can configure it with >>>>>>> $./configure CC=mpicc CXX=mpicxx FC=mpifort --enable-strictmpi >>>>>>> --enable-threads=funneled --enable-fortran=f77,f90 --enable-fast >>>>>>> --disable-spawn --disable-cxx --disable-ft-tests // It is weird I >>>>>>> disabled >>>>>>> cxx but I had to set CXX! >>>>>>> $make -k -j8 // -k is to keep going and ignore compilation errors, >>>>>>> e.g., when building tests for MPICH extensions not in MPI standard, but >>>>>>> your MPI is OpenMPI. >>>>>>> $ // edit testlist, remove lines mpi_t, rma, f77, impls. Those are >>>>>>> sub-dirs containing tests for MPI routines Petsc does not rely on. >>>>>>> $ make testings or directly './runtests -tests=testlist' >>>>>>> >>>>>>> On a batch system, >>>>>>> $export MPITEST_BATCHDIR=`pwd`/btest // specify a batch dir, >>>>>>> say btest, >>>>>>> $./runtests -batch -mpiexec=mpirun -np=1024 -tests=testlist // Use >>>>>>> 1024 ranks if a test does no specify the number of processes. >>>>>>> $ // It copies test binaries to the batch dir and generates a >>>>>>> script runtests.batch there. Edit the script to fit your batch system >>>>>>> and >>>>>>> then submit a job and wait for its finish. >>>>>>> $ cd btest && ../checktests --ignorebogus >>>>>>> >>>>>>> >>>>>>> PS: Fande, changing an MPI fixed your problem does not >>>>>>> necessarily mean the old MPI has bugs. It is complicated. It could be a >>>>>>> petsc bug. You need to provide us a code to reproduce your error. It >>>>>>> does >>>>>>> not matter if the code is big. >>>>>>> >>>>>>> >>>>>>>> Thanks >>>>>>>> >>>>>>>> Barry >>>>>>>> >>>>>>>> >>>>>>>> On Jul 20, 2020, at 12:16 AM, Fande Kong <fdkong...@gmail.com> >>>>>>>> wrote: >>>>>>>> >>>>>>>> Trace could look like this: >>>>>>>> >>>>>>>> [640]PETSC ERROR: --------------------- Error Message >>>>>>>> -------------------------------------------------------------- >>>>>>>> [640]PETSC ERROR: Argument out of range >>>>>>>> [640]PETSC ERROR: key 45226154 is greater than largest key allowed >>>>>>>> 740521 >>>>>>>> [640]PETSC ERROR: See >>>>>>>> https://www.mcs.anl.gov/petsc/documentation/faq.html for trouble >>>>>>>> shooting. >>>>>>>> [640]PETSC ERROR: Petsc Release Version 3.13.3, unknown >>>>>>>> [640]PETSC ERROR: ../../griffin-opt on a arch-moose named r6i5n18 >>>>>>>> by wangy2 Sun Jul 19 17:14:28 2020 >>>>>>>> [640]PETSC ERROR: Configure options --download-hypre=1 >>>>>>>> --with-debugging=no --with-shared-libraries=1 --download-fblaslapack=1 >>>>>>>> --download-metis=1 --download-ptscotch=1 --download-parmetis=1 >>>>>>>> --download-superlu_dist=1 --download-mumps=1 --download-scalapack=1 >>>>>>>> --download-slepc=1 --with-mpi=1 --with-cxx-dialect=C++11 >>>>>>>> --with-fortran-bindings=0 --with-sowing=0 --with-64-bit-indices >>>>>>>> --download-mumps=0 >>>>>>>> [640]PETSC ERROR: #1 PetscTableFind() line 132 in >>>>>>>> /home/wangy2/trunk/sawtooth/griffin/moose/petsc/include/petscctable.h >>>>>>>> [640]PETSC ERROR: #2 MatSetUpMultiply_MPIAIJ() line 33 in >>>>>>>> /home/wangy2/trunk/sawtooth/griffin/moose/petsc/src/mat/impls/aij/mpi/mmaij.c >>>>>>>> [640]PETSC ERROR: #3 MatAssemblyEnd_MPIAIJ() line 876 in >>>>>>>> /home/wangy2/trunk/sawtooth/griffin/moose/petsc/src/mat/impls/aij/mpi/mpiaij.c >>>>>>>> [640]PETSC ERROR: #4 MatAssemblyEnd() line 5347 in >>>>>>>> /home/wangy2/trunk/sawtooth/griffin/moose/petsc/src/mat/interface/matrix.c >>>>>>>> [640]PETSC ERROR: #5 MatPtAPNumeric_MPIAIJ_MPIXAIJ_allatonce() line >>>>>>>> 901 in >>>>>>>> /home/wangy2/trunk/sawtooth/griffin/moose/petsc/src/mat/impls/aij/mpi/mpiptap.c >>>>>>>> [640]PETSC ERROR: #6 MatPtAPNumeric_MPIAIJ_MPIMAIJ_allatonce() line >>>>>>>> 3180 in >>>>>>>> /home/wangy2/trunk/sawtooth/griffin/moose/petsc/src/mat/impls/maij/maij.c >>>>>>>> [640]PETSC ERROR: #7 MatProductNumeric_PtAP() line 704 in >>>>>>>> /home/wangy2/trunk/sawtooth/griffin/moose/petsc/src/mat/interface/matproduct.c >>>>>>>> [640]PETSC ERROR: #8 MatProductNumeric() line 759 in >>>>>>>> /home/wangy2/trunk/sawtooth/griffin/moose/petsc/src/mat/interface/matproduct.c >>>>>>>> [640]PETSC ERROR: #9 MatPtAP() line 9199 in >>>>>>>> /home/wangy2/trunk/sawtooth/griffin/moose/petsc/src/mat/interface/matrix.c >>>>>>>> [640]PETSC ERROR: #10 MatGalerkin() line 10236 in >>>>>>>> /home/wangy2/trunk/sawtooth/griffin/moose/petsc/src/mat/interface/matrix.c >>>>>>>> [640]PETSC ERROR: #11 PCSetUp_MG() line 745 in >>>>>>>> /home/wangy2/trunk/sawtooth/griffin/moose/petsc/src/ksp/pc/impls/mg/mg.c >>>>>>>> [640]PETSC ERROR: #12 PCSetUp_HMG() line 220 in >>>>>>>> /home/wangy2/trunk/sawtooth/griffin/moose/petsc/src/ksp/pc/impls/hmg/hmg.c >>>>>>>> [640]PETSC ERROR: #13 PCSetUp() line 898 in >>>>>>>> /home/wangy2/trunk/sawtooth/griffin/moose/petsc/src/ksp/pc/interface/precon.c >>>>>>>> [640]PETSC ERROR: #14 KSPSetUp() line 376 in >>>>>>>> /home/wangy2/trunk/sawtooth/griffin/moose/petsc/src/ksp/ksp/interface/itfunc.c >>>>>>>> [640]PETSC ERROR: #15 KSPSolve_Private() line 633 in >>>>>>>> /home/wangy2/trunk/sawtooth/griffin/moose/petsc/src/ksp/ksp/interface/itfunc.c >>>>>>>> [640]PETSC ERROR: #16 KSPSolve() line 853 in >>>>>>>> /home/wangy2/trunk/sawtooth/griffin/moose/petsc/src/ksp/ksp/interface/itfunc.c >>>>>>>> [640]PETSC ERROR: #17 SNESSolve_NEWTONLS() line 225 in >>>>>>>> /home/wangy2/trunk/sawtooth/griffin/moose/petsc/src/snes/impls/ls/ls.c >>>>>>>> [640]PETSC ERROR: #18 SNESSolve() line 4519 in >>>>>>>> /home/wangy2/trunk/sawtooth/griffin/moose/petsc/src/snes/interface/snes.c >>>>>>>> >>>>>>>> On Sun, Jul 19, 2020 at 11:13 PM Fande Kong <fdkong...@gmail.com> >>>>>>>> wrote: >>>>>>>> >>>>>>>>> I am not entirely sure what is happening, but we encountered >>>>>>>>> similar issues recently. It was not reproducible. It might occur at >>>>>>>>> different stages, and errors could be weird other than "ctable >>>>>>>>> stuff." Our >>>>>>>>> code was Valgrind clean since every PR in moose needs to go through >>>>>>>>> rigorous Valgrind checks before it reaches the devel branch. The >>>>>>>>> errors >>>>>>>>> happened when we used mvapich. >>>>>>>>> >>>>>>>>> We changed to use HPE-MPT (a vendor stalled MPI), then everything >>>>>>>>> was smooth. May you try a different MPI? It is better to try a system >>>>>>>>> carried one. >>>>>>>>> >>>>>>>>> We did not get the bottom of this problem yet, but we at least >>>>>>>>> know this is kind of MPI-related. >>>>>>>>> >>>>>>>>> Thanks, >>>>>>>>> >>>>>>>>> Fande, >>>>>>>>> >>>>>>>>> >>>>>>>>> On Sun, Jul 19, 2020 at 3:28 PM Chris Hewson <ch...@resfrac.com> >>>>>>>>> wrote: >>>>>>>>> >>>>>>>>>> Hi, >>>>>>>>>> >>>>>>>>>> I am having a bug that is occurring in PETSC with the return >>>>>>>>>> string: >>>>>>>>>> >>>>>>>>>> [7]PETSC ERROR: PetscTableFind() line 132 in >>>>>>>>>> /home/chewson/petsc-3.13.2/include/petscctable.h key 7556 is greater >>>>>>>>>> than >>>>>>>>>> largest key allowed 5693 >>>>>>>>>> >>>>>>>>>> This is using petsc-3.13.2, compiled and running using mpich with >>>>>>>>>> -O3 and debugging turned off tuned to the haswell architecture and >>>>>>>>>> occurring either before or during a KSPBCGS solve/setup or during a >>>>>>>>>> MUMPS >>>>>>>>>> factorization solve (I haven't been able to replicate this issue >>>>>>>>>> with the >>>>>>>>>> same set of instructions etc.). >>>>>>>>>> >>>>>>>>>> This is a terrible way to ask a question, I know, and not very >>>>>>>>>> helpful from your side, but this is what I have from a user's run >>>>>>>>>> and can't >>>>>>>>>> reproduce on my end (either with the optimization compilation or with >>>>>>>>>> debugging turned on). This happens when the code has run for quite >>>>>>>>>> some >>>>>>>>>> time and is happening somewhat rarely. >>>>>>>>>> >>>>>>>>>> More than likely I am using a static variable (code is written in >>>>>>>>>> c++) that I'm not updating when the matrix size is changing or >>>>>>>>>> something >>>>>>>>>> silly like that, but any help or guidance on this would be >>>>>>>>>> appreciated. >>>>>>>>>> >>>>>>>>>> *Chris Hewson* >>>>>>>>>> Senior Reservoir Simulation Engineer >>>>>>>>>> ResFrac >>>>>>>>>> +1.587.575.9792 >>>>>>>>>> >>>>>>>>> >>>>>>>> >>>>> >>> > > -- > What most experimenters take for granted before they begin their > experiments is infinitely more interesting than any results to which their > experiments lead. > -- Norbert Wiener > > https://www.cse.buffalo.edu/~knepley/ > <http://www.cse.buffalo.edu/~knepley/> > > > -- What most experimenters take for granted before they begin their experiments is infinitely more interesting than any results to which their experiments lead. -- Norbert Wiener https://www.cse.buffalo.edu/~knepley/ <http://www.cse.buffalo.edu/~knepley/>