Re: [petsc-users] Tough to reproduce petsctablefind error

Barry Smith Thu, 24 Sep 2020 13:40:09 -0700

  Oh, sorry, my mistake.


> On Sep 24, 2020, at 3:31 PM, Junchao Zhang <junchao.zh...@gmail.com> wrote:
> 
> That error stack was from Fande Kong. We should wait for Chris's.
> 
> --Junchao Zhang
> 
> 
> On Thu, Sep 24, 2020 at 2:42 PM Barry Smith <bsm...@petsc.dev 
> <mailto:bsm...@petsc.dev>> wrote:
> 
>   The stack is listed below. It crashes inside MatPtAP(). 
> 
>   It is possible there is some subtle bug in the rather complex PETSc code 
> for MatPtAP() but I am included to blame MPI again. 
> 
>   I think we should add some simple low-overhead always on communication 
> error detecting code to PetscSF where some check sums are also communicated 
> at the highest level of PetscSF(). 
> 
>    I don't know how but perhaps when the data is packed per destination rank 
> a checksum is computed and when unpacked the checksum is compared using extra 
> space at the end of the communicated packed array to store and send the 
> checksum. Yes, it is kind of odd for a high level library like PETSc to not 
> trust the communication channel but MPI implementations have proven 
> themselves to not be trustworthy and adding this to PetscSF is not intrusive 
> to the PETSc API or user. Plus it gives a definitive yes or no as to the 
> problem being from an error in the communication.
> 
>   Barry
> 
>> On Sep 24, 2020, at 12:35 PM, Matthew Knepley <knep...@gmail.com 
>> <mailto:knep...@gmail.com>> wrote:
>> 
>> On Thu, Sep 24, 2020 at 1:22 PM Chris Hewson <ch...@resfrac.com 
>> <mailto:ch...@resfrac.com>> wrote:
>> Hi Guys,
>> 
>> Thanks for all of the prompt responses, very helpful and appreciated.
>> 
>> By "when debugging", did you mean when configure petsc --with-debugging=1 
>> COPTFLAGS=-O0 -g etc or when you attached a debugger?
>> - Both, I have run with a debugger attached and detached, all compiled with 
>> the following flags "--with-debugging=1 COPTFLAGS=-O0 -g"
>> 
>> 1) Try OpenMPI (probably won't help, but worth trying)
>> - Worth a try for sure
>> 
>> 2) Find which part of the simulation makes it non-deterministic. Is it the 
>> mesh partitioning (parmetis)? Then try to make it deterministic. 
>> - Good tip, it is the mesh partitioning and along the lines of a question 
>> from Barry, the matrix size is changing. I will make this deterministic and 
>> give it a try
>> 
>> 3) Dump matrices, vectors, etc and see when it fails, you can quickly 
>> reproduce the error by reading in the intermediate data.
>> - Also a great suggestion, will give it a try
>> 
>> The full stack would be really useful here. I am guessing this happens on 
>> MatMult(), but I do not know.
>> - Agreed, I am currently running it so that the full stack will be produced, 
>> but waiting for it to fail, had compiled with all available optimizations 
>> on, but downside is of course if there is a failure.
>> As a general question, roughly what's the performance impact on petsc with 
>> -o1/-o2/-o0 as opposed to -o3? Performance impact of --with-debugging = 1?
>> Obviously problem/machine dependant, wondering on guidance more for this 
>> than anything
>> 
>> Is the nonzero structure of your matrices changing or is it fixed for the 
>> entire simulation?
>> The non-zero structure is changing, although the structures are reformed 
>> when this happens and this happens thousands of time before the failure has 
>> occured.
>> 
>> Okay, this is the most likely spot for a bug. How are you changing the 
>> matrix? It should be impossible to put in an invalid column index when using 
>> MatSetValues()
>> because we check all the inputs. However, I do not think we check when you 
>> just yank out the arrays.
>> 
>>   Thanks,
>> 
>>      Matt
>>  
>> Does this particular run always crash at the same place? Similar place? 
>> Doesn't always crash?
>> Doesn't always crash, but other similar runs have crashed in different 
>> spots, which makes it difficult to resolve. I am going to try out a few of 
>> the strategies suggested above and will let you know what comes of that. 
>> 
>> Chris Hewson
>> Senior Reservoir Simulation Engineer
>> ResFrac
>> +1.587.575.9792
>> 
>> 
>> On Thu, Sep 24, 2020 at 11:05 AM Barry Smith <bsm...@petsc.dev 
>> <mailto:bsm...@petsc.dev>> wrote:
>>  Chris,
>> 
>>    We realize how frustrating this type of problem is to deal with. 
>> 
>>    Here is the code:
>> 
>>       ierr = 
>> PetscTableCreate(aij->B->rmap->n,mat->cmap->N+1,&gid1_lid1);CHKERRQ(ierr);
>>     for (i=0; i<aij->B->rmap->n; i++) {
>>       for (j=0; j<B->ilen[i]; j++) {
>>         PetscInt data,gid1 = aj[B->i[i] + j] + 1;
>>         ierr = PetscTableFind(gid1_lid1,gid1,&data);CHKERRQ(ierr);
>>         if (!data) {
>>           /* one based table */
>>           ierr = 
>> PetscTableAdd(gid1_lid1,gid1,++ec,INSERT_VALUES);CHKERRQ(ierr);
>>         }
>>       }
>>     }
>> 
>>    It is simply looping over the rows of the sparse matrix putting the 
>> columns it finds into the hash table. 
>> 
>>    aj[B->i[i] + j]  are the column entries, the number of columns in the 
>> matrix is mat->cmap->N so the column entries should always be 
>> less than the number of columns. The code is crashing when column entry 
>> 24443 which is larger than the number of columns 23988.
>> 
>> So either the aj[B->i[i] + j] + 1 are incorrect or the mat->cmap->N is 
>> incorrect.
>> 
>>>>> 640]PETSC ERROR: #3 MatAssemblyEnd_MPIAIJ() line 876 in 
>>>>> /home/wangy2/trunk/sawtooth/griffin/moose/petsc/src/mat/impls/aij/mpi/mpiaij.c
>> 
>>  if (!mat->was_assembled && mode == MAT_FINAL_ASSEMBLY) {
>>     ierr = MatSetUpMultiply_MPIAIJ(mat);CHKERRQ(ierr);
>>   }
>> 
>> Seems to indicate it is setting up a new multiple because it is either the 
>> first time into the algorithm or the nonzero structure changed on some rank 
>> requiring a new assembly process. 
>> 
>>    Is the nonzero structure of your matrices changing or is it fixed for the 
>> entire simulation? 
>> 
>> Since the code has been running for a very long time already I have to 
>> conclude that this is not the first time through and so something has 
>> changed in the matrix?
>> 
>> I think we have to put more diagnostics into the library to provide more 
>> information before or at the time of the error detection. 
>> 
>>    Does this particular run always crash at the same place? Similar place? 
>> Doesn't always crash?
>> 
>>   Barry
>> 
>> 
>> 
>> 
>>> On Sep 24, 2020, at 8:46 AM, Chris Hewson <ch...@resfrac.com 
>>> <mailto:ch...@resfrac.com>> wrote:
>>> 
>>> After about a month of not having this issue pop up, it has come up again
>>> 
>>> We have been struggling with a similar PETSc Error for awhile now, the 
>>> error message is as follows:
>>> 
>>> [7]PETSC ERROR: PetscTableFind() line 132 in 
>>> /home/chewson/petsc-3.13.3/include/petscctable.h key 24443 is greater than 
>>> largest key allowed 23988
>>> 
>>> It is a particularly nasty bug as it doesn't reproduce itself when 
>>> debugging and doesn't happen all the time with the same inputs either. The 
>>> problem occurs after a long runtime of the code (12-40 hours) and we are 
>>> using a ksp solve with KSPBCGS. 
>>> 
>>> The PETSc compilation options that are used are:
>>> 
>>> --download-metis
>>>     --download-mpich
>>>     --download-mumps
>>>     --download-parmetis
>>>     --download-ptscotch
>>>     --download-scalapack
>>>     --download-suitesparse
>>>     --prefix=/opt/anl/petsc-3.13.3
>>>     --with-debugging=0
>>>     --with-mpi=1
>>>     COPTFLAGS=-O3 -march=haswell -mtune=haswell
>>>     CXXOPTFLAGS=-O3 -march=haswell -mtune=haswell
>>>     FOPTFLAGS=-O3 -march=haswell -mtune=haswell
>>> 
>>> This is being run across 8 processes and is failing consistently on the 
>>> rank 7 process. We also use openmp outside of PETSC and the linear solve 
>>> portion of the code. The rank 0 process is always using compute, during 
>>> this the slave processes use an MPI_Wait call to wait for the collective 
>>> parts of the code to be called again. All PETSC calls are done across all 
>>> of the processes.
>>> 
>>> We are using mpich version 3.3.2, downloaded with the petsc 3.13.3 package.
>>> 
>>> At every PETSC call we are checking the error return from the function 
>>> collectively to ensure that no errors have been returned from PETSC.
>>> 
>>> Some possible causes that I can think of are as follows:
>>> 1. Memory leak causing a corruption either in our program or in petsc or 
>>> with one of the petsc objects. This seems unlikely as we have checked runs 
>>> with the option -malloc_dump for PETSc and using valgrind.
>>> 
>>> 2. Optimization flags set for petsc compilation are causing variables that 
>>> go out of scope to be optimized out. 
>>> 
>>> 3. We are giving the wrong number of elements for a process or the value is 
>>> changing during the simulation. This seems unlikely as there is nothing 
>>> overly unique about these simulations and it's not reproducing itself.
>>> 
>>> 4. An MPI channel or socket error causing an error in the collective values 
>>> for PETSc.
>>> 
>>> Any input on this issue would be greatly appreciated.
>>> 
>>> Chris Hewson
>>> Senior Reservoir Simulation Engineer
>>> ResFrac
>>> +1.587.575.9792
>>> 
>>> 
>>> On Thu, Aug 13, 2020 at 4:21 PM Junchao Zhang <junchao.zh...@gmail.com 
>>> <mailto:junchao.zh...@gmail.com>> wrote:
>>> That is a great idea. I'll figure it out.
>>> --Junchao Zhang
>>> 
>>> 
>>> On Thu, Aug 13, 2020 at 4:31 PM Barry Smith <bsm...@petsc.dev 
>>> <mailto:bsm...@petsc.dev>> wrote:
>>> 
>>>   Junchao,
>>> 
>>>     Any way in the PETSc configure to warn that MPICH version is "bad" or 
>>> "untrustworthy" or even the vague "we have suspicians about this version 
>>> and recommend avoiding it"? A lot of time could be saved if others don't 
>>> deal with the same problem.
>>> 
>>>     Maybe add arrays of suspect_versions for OpenMPI, MPICH, etc and always 
>>> check against that list and print a boxed warning at configure time? Better 
>>> you could somehow generalize it and put it in package.py for use by all 
>>> packages, then any package can included lists of "suspect" versions. (There 
>>> are definitely HDF5 versions that should be avoided :-)).
>>> 
>>>     Barry
>>> 
>>> 
>>>> On Aug 13, 2020, at 12:14 PM, Junchao Zhang <junchao.zh...@gmail.com 
>>>> <mailto:junchao.zh...@gmail.com>> wrote:
>>>> 
>>>> Thanks for the update. Let's assume it is a bug in MPI :)
>>>> --Junchao Zhang
>>>> 
>>>> 
>>>> On Thu, Aug 13, 2020 at 11:15 AM Chris Hewson <ch...@resfrac.com 
>>>> <mailto:ch...@resfrac.com>> wrote:
>>>> Just as an update to this, I can confirm that using the mpich version 
>>>> (3.3.2) downloaded with the petsc download solved this issue on my end.
>>>> 
>>>> Chris Hewson
>>>> Senior Reservoir Simulation Engineer
>>>> ResFrac
>>>> +1.587.575.9792
>>>> 
>>>> 
>>>> On Thu, Jul 23, 2020 at 5:58 PM Junchao Zhang <junchao.zh...@gmail.com 
>>>> <mailto:junchao.zh...@gmail.com>> wrote:
>>>> On Mon, Jul 20, 2020 at 7:05 AM Barry Smith <bsm...@petsc.dev 
>>>> <mailto:bsm...@petsc.dev>> wrote:
>>>> 
>>>>     Is there a comprehensive  MPI test suite (perhaps from MPICH)? Is 
>>>> there any way to run this full test suite under the problematic MPI and 
>>>> see if it detects any problems? 
>>>> 
>>>>      Is so, could someone add it to the FAQ in the debugging section?
>>>> MPICH does have a test suite. It is at the subdir test/mpi of downloaded 
>>>> mpich <http://www.mpich.org/static/downloads/3.3.2/mpich-3.3.2.tar.gz>.  
>>>> It annoyed me since it is not user-friendly.  It might be helpful in 
>>>> catching bugs at very small scale. But say if I want to test allreduce on 
>>>> 1024 ranks on 100 doubles, I have to hack the test suite.
>>>> Anyway, the instructions are here.
>>>> For the purpose of petsc, under test/mpi one can configure it with
>>>> $./configure CC=mpicc CXX=mpicxx FC=mpifort --enable-strictmpi 
>>>> --enable-threads=funneled --enable-fortran=f77,f90 --enable-fast 
>>>> --disable-spawn --disable-cxx --disable-ft-tests  // It is weird I 
>>>> disabled cxx but I had to set CXX!
>>>> $make -k -j8  // -k is to keep going and ignore compilation errors, e.g., 
>>>> when building tests for MPICH extensions not in MPI standard, but your MPI 
>>>> is OpenMPI.
>>>> $ // edit testlist, remove lines mpi_t, rma, f77, impls. Those are 
>>>> sub-dirs containing tests for MPI routines Petsc does not rely on. 
>>>> $ make testings or directly './runtests -tests=testlist'
>>>> 
>>>> On a batch system, 
>>>> $export MPITEST_BATCHDIR=`pwd`/btest       // specify a batch dir, say 
>>>> btest, 
>>>> $./runtests -batch -mpiexec=mpirun -np=1024 -tests=testlist   // Use 1024 
>>>> ranks if a test does no specify the number of processes.
>>>> $ // It copies test binaries to the batch dir and generates a script 
>>>> runtests.batch there.  Edit the script to fit your batch system and then 
>>>> submit a job and wait for its finish. 
>>>> $ cd btest && ../checktests --ignorebogus
>>>> 
>>>> PS: Fande, changing an MPI fixed your problem does not necessarily mean 
>>>> the old MPI has bugs. It is complicated. It could be a petsc bug.  You 
>>>> need to provide us a code to reproduce your error. It does not matter if 
>>>> the code is big. 
>>>> 
>>>> 
>>>>     Thanks
>>>> 
>>>>       Barry
>>>> 
>>>> 
>>>>> On Jul 20, 2020, at 12:16 AM, Fande Kong <fdkong...@gmail.com 
>>>>> <mailto:fdkong...@gmail.com>> wrote:
>>>>> 
>>>>> Trace could look like this:
>>>>> 
>>>>> [640]PETSC ERROR: --------------------- Error Message 
>>>>> --------------------------------------------------------------
>>>>> [640]PETSC ERROR: Argument out of range
>>>>> [640]PETSC ERROR: key 45226154 is greater than largest key allowed 740521
>>>>> [640]PETSC ERROR: See 
>>>>> https://www.mcs.anl.gov/petsc/documentation/faq.html 
>>>>> <https://www.mcs.anl.gov/petsc/documentation/faq.html> for trouble 
>>>>> shooting.
>>>>> [640]PETSC ERROR: Petsc Release Version 3.13.3, unknown 
>>>>> [640]PETSC ERROR: ../../griffin-opt on a arch-moose named r6i5n18 by 
>>>>> wangy2 Sun Jul 19 17:14:28 2020
>>>>> [640]PETSC ERROR: Configure options --download-hypre=1 
>>>>> --with-debugging=no --with-shared-libraries=1 --download-fblaslapack=1 
>>>>> --download-metis=1 --download-ptscotch=1 --download-parmetis=1 
>>>>> --download-superlu_dist=1 --download-mumps=1 --download-scalapack=1 
>>>>> --download-slepc=1 --with-mpi=1 --with-cxx-dialect=C++11 
>>>>> --with-fortran-bindings=0 --with-sowing=0 --with-64-bit-indices 
>>>>> --download-mumps=0
>>>>> [640]PETSC ERROR: #1 PetscTableFind() line 132 in 
>>>>> /home/wangy2/trunk/sawtooth/griffin/moose/petsc/include/petscctable.h
>>>>> [640]PETSC ERROR: #2 MatSetUpMultiply_MPIAIJ() line 33 in 
>>>>> /home/wangy2/trunk/sawtooth/griffin/moose/petsc/src/mat/impls/aij/mpi/mmaij.c
>>>>> [640]PETSC ERROR: #3 MatAssemblyEnd_MPIAIJ() line 876 in 
>>>>> /home/wangy2/trunk/sawtooth/griffin/moose/petsc/src/mat/impls/aij/mpi/mpiaij.c
>>>>> [640]PETSC ERROR: #4 MatAssemblyEnd() line 5347 in 
>>>>> /home/wangy2/trunk/sawtooth/griffin/moose/petsc/src/mat/interface/matrix.c
>>>>> [640]PETSC ERROR: #5 MatPtAPNumeric_MPIAIJ_MPIXAIJ_allatonce() line 901 
>>>>> in 
>>>>> /home/wangy2/trunk/sawtooth/griffin/moose/petsc/src/mat/impls/aij/mpi/mpiptap.c
>>>>> [640]PETSC ERROR: #6 MatPtAPNumeric_MPIAIJ_MPIMAIJ_allatonce() line 3180 
>>>>> in 
>>>>> /home/wangy2/trunk/sawtooth/griffin/moose/petsc/src/mat/impls/maij/maij.c
>>>>> [640]PETSC ERROR: #7 MatProductNumeric_PtAP() line 704 in 
>>>>> /home/wangy2/trunk/sawtooth/griffin/moose/petsc/src/mat/interface/matproduct.c
>>>>> [640]PETSC ERROR: #8 MatProductNumeric() line 759 in 
>>>>> /home/wangy2/trunk/sawtooth/griffin/moose/petsc/src/mat/interface/matproduct.c
>>>>> [640]PETSC ERROR: #9 MatPtAP() line 9199 in 
>>>>> /home/wangy2/trunk/sawtooth/griffin/moose/petsc/src/mat/interface/matrix.c
>>>>> [640]PETSC ERROR: #10 MatGalerkin() line 10236 in 
>>>>> /home/wangy2/trunk/sawtooth/griffin/moose/petsc/src/mat/interface/matrix.c
>>>>> [640]PETSC ERROR: #11 PCSetUp_MG() line 745 in 
>>>>> /home/wangy2/trunk/sawtooth/griffin/moose/petsc/src/ksp/pc/impls/mg/mg.c
>>>>> [640]PETSC ERROR: #12 PCSetUp_HMG() line 220 in 
>>>>> /home/wangy2/trunk/sawtooth/griffin/moose/petsc/src/ksp/pc/impls/hmg/hmg.c
>>>>> [640]PETSC ERROR: #13 PCSetUp() line 898 in 
>>>>> /home/wangy2/trunk/sawtooth/griffin/moose/petsc/src/ksp/pc/interface/precon.c
>>>>> [640]PETSC ERROR: #14 KSPSetUp() line 376 in 
>>>>> /home/wangy2/trunk/sawtooth/griffin/moose/petsc/src/ksp/ksp/interface/itfunc.c
>>>>> [640]PETSC ERROR: #15 KSPSolve_Private() line 633 in 
>>>>> /home/wangy2/trunk/sawtooth/griffin/moose/petsc/src/ksp/ksp/interface/itfunc.c
>>>>> [640]PETSC ERROR: #16 KSPSolve() line 853 in 
>>>>> /home/wangy2/trunk/sawtooth/griffin/moose/petsc/src/ksp/ksp/interface/itfunc.c
>>>>> [640]PETSC ERROR: #17 SNESSolve_NEWTONLS() line 225 in 
>>>>> /home/wangy2/trunk/sawtooth/griffin/moose/petsc/src/snes/impls/ls/ls.c
>>>>> [640]PETSC ERROR: #18 SNESSolve() line 4519 in 
>>>>> /home/wangy2/trunk/sawtooth/griffin/moose/petsc/src/snes/interface/snes.c
>>>>> 
>>>>> On Sun, Jul 19, 2020 at 11:13 PM Fande Kong <fdkong...@gmail.com 
>>>>> <mailto:fdkong...@gmail.com>> wrote:
>>>>> I am not entirely sure what is happening, but we encountered similar 
>>>>> issues recently.  It was not reproducible. It might occur at different 
>>>>> stages, and errors could be weird other than "ctable stuff." Our code was 
>>>>> Valgrind clean since every PR in moose needs to go through rigorous 
>>>>> Valgrind checks before it reaches the devel branch.  The errors happened 
>>>>> when we used mvapich.
>>>>> 
>>>>> We changed to use HPE-MPT (a vendor stalled MPI), then everything was 
>>>>> smooth.  May you try a different MPI? It is better to try a system 
>>>>> carried one. 
>>>>> 
>>>>> We did not get the bottom of this problem yet, but we at least know this 
>>>>> is kind of MPI-related. 
>>>>> 
>>>>> Thanks,
>>>>> 
>>>>> Fande,
>>>>> 
>>>>> 
>>>>> On Sun, Jul 19, 2020 at 3:28 PM Chris Hewson <ch...@resfrac.com 
>>>>> <mailto:ch...@resfrac.com>> wrote:
>>>>> Hi,
>>>>> 
>>>>> I am having a bug that is occurring in PETSC with the return string:
>>>>> 
>>>>> [7]PETSC ERROR: PetscTableFind() line 132 in 
>>>>> /home/chewson/petsc-3.13.2/include/petscctable.h key 7556 is greater than 
>>>>> largest key allowed 5693
>>>>> 
>>>>> This is using petsc-3.13.2, compiled and running using mpich with -O3 and 
>>>>> debugging turned off tuned to the haswell architecture and occurring 
>>>>> either before or during a KSPBCGS solve/setup or during a MUMPS 
>>>>> factorization solve (I haven't been able to replicate this issue with the 
>>>>> same set of instructions etc.).
>>>>> 
>>>>> This is a terrible way to ask a question, I know, and not very helpful 
>>>>> from your side, but this is what I have from a user's run and can't 
>>>>> reproduce on my end (either with the optimization compilation or with 
>>>>> debugging turned on). This happens when the code has run for quite some 
>>>>> time and is happening somewhat rarely.
>>>>> 
>>>>> More than likely I am using a static variable (code is written in c++) 
>>>>> that I'm not updating when the matrix size is changing or something silly 
>>>>> like that, but any help or guidance on this would be appreciated. 
>>>>> 
>>>>> Chris Hewson
>>>>> Senior Reservoir Simulation Engineer
>>>>> ResFrac
>>>>> +1.587.575.9792
>>>> 
>>> 
>> 
>> 
>> 
>> -- 
>> What most experimenters take for granted before they begin their experiments 
>> is infinitely more interesting than any results to which their experiments 
>> lead.
>> -- Norbert Wiener
>> 
>> https://www.cse.buffalo.edu/~knepley/ <http://www.cse.buffalo.edu/~knepley/>
>

Re: [petsc-users] Tough to reproduce petsctablefind error

Reply via email to