On Fri, Sep 25, 2020 at 1:29 PM Mark McClure <m...@resfrac.com> wrote:
> Hello, > > I work with Chris, and have a few comments, hopefully helpful. Thank you > all, for your help. > > Our program is unfortunately behaving a little bit nondeterministically. I > am not sure why because for the OpenMP parts, I test it for race conditions > using Intel Inspector and see none. We are using Metis, not Parmetis. Do > Petsc or Metis have any behavior that is nondeterministic? We will continue > to investigate the cause of the nondeterminism. But that is a challenge for > us reproducing this problem because we can run the same simulation 10 > times, and we do not get precisely the same result each time. In fact, for > this bug, Chris did run it 10 times and did not reproduce. > > Every month, 1000s of hours of this simulator are being run on the > Azure cluster. This crash has been occurring for months, but at infrequent > intervals, and have never been able to reproduce it. As such, we can't > generate an error dump, unless we gave all users a Petsc build with no > optimization and waited weeks until a problem cropped up. > > Here's what we'll do - we'll internally make a build with debug and then > run a large number of simulations until the problem reproduces and we get > the error dump. That will take us some time, but we'll do it. > > Here is a bit more background that might be helpful. At first, we used > OpenMPI 2.1.1-8 with Petsc 3.13.2. With that combo, we observed a memory > leak, and simulations failed when the node ran out of RAM. Then we switched > to MPICH3.3a2 and Petsc 3.13.3. The memory leak went away. That's when we > started seeing this bug "greater than largest key allowed". It was > unreproducible, but happening relatively more often than it is now (I > think) - I was getting a couple user reports a week. Then, we updated to > MPICH 3.3.2 and the same Petsc version (3.13.3). The problem seems to be > less common - hadn't happened for the past month. But then it happened four > times this week. > > Other background to note - our linear system very frequently changes > size/shape. There will be 10,000s of Petsc solves with different matrices > (different positions of nonzero values) over the course of a simulation. As > far as we can tell, the crash only occurs only after the simulator has run > for a long time, 10+ hours. Having said that, it does not appear to be a > memory leak - at least, the node has plenty of remaining RAM when these > crashes are occurring. > If it helps, I believe you can configure an optimized build to add -g, COPTFLAGS="-g" CPPOPTFLAGS="-g" so that symbols are kept in the optimized libraries. That way we should be able to get a stack on a run that performs well. Thanks, Matt > Mark > > > On Thu, Sep 24, 2020 at 4:41 PM Mark Adams <mfad...@lbl.gov> wrote: > >> You might add code here like: >> >> if (ierr) { >> for (; i<aij->B->rmap->n; i++) { >> for ( j<B->ilen[i]; j++) { >> PetscInt data,gid1 = aj[B->i[i] + j] + 1; // don't use ierr >> print rank, gid1; >> } >> CHKERRQ(ierr); >> >> I am guessing that somehow you have a table that is bad and too small. It >> failed on an index not much bigger than the largest key allowed. Maybe just >> compute the max and see if it goes much larger than the largest key allowed. >> >> If your mesh just changed to you know if it got bigger or smaller... >> >> Anyway just some thoughts, >> Mark >> >> >> >> On Thu, Sep 24, 2020 at 4:18 PM Barry Smith <bsm...@petsc.dev> wrote: >> >>> >>> >>> On Sep 24, 2020, at 2:47 PM, Matthew Knepley <knep...@gmail.com> wrote: >>> >>> On Thu, Sep 24, 2020 at 3:42 PM Barry Smith <bsm...@petsc.dev> wrote: >>> >>>> >>>> The stack is listed below. It crashes inside MatPtAP(). >>>> >>> >>> What about just checking that the column indices that PtAP receives are >>> valid? Are we not doing that? >>> >>> >>> The code that checks for column too large in MatSetValues_MPIXXAIJ() >>> is turned off for optimized builds, I am making a MR to always have it on. >>> But I doubt this is the problem, other, more harsh crashes, are likely if >>> the column index is too large. >>> >>> This is difficult to debug because all we get is a stack trace. It >>> would be good if we produce some information about the current state of the >>> objects when the error is detected. We should think about what light-weight >>> stuff we could report when errors are detected. >>> >>> >>> Barry >>> >>> >>> Matt >>> >>> >>>> It is possible there is some subtle bug in the rather complex PETSc >>>> code for MatPtAP() but I am included to blame MPI again. >>>> >>>> I think we should add some simple low-overhead always on >>>> communication error detecting code to PetscSF where some check sums are >>>> also communicated at the highest level of PetscSF(). >>>> >>>> I don't know how but perhaps when the data is packed per destination >>>> rank a checksum is computed and when unpacked the checksum is compared >>>> using extra space at the end of the communicated packed array to store and >>>> send the checksum. Yes, it is kind of odd for a high level library like >>>> PETSc to not trust the communication channel but MPI implementations have >>>> proven themselves to not be trustworthy and adding this to PetscSF is not >>>> intrusive to the PETSc API or user. Plus it gives a definitive yes or no as >>>> to the problem being from an error in the communication. >>>> >>>> Barry >>>> >>>> On Sep 24, 2020, at 12:35 PM, Matthew Knepley <knep...@gmail.com> >>>> wrote: >>>> >>>> On Thu, Sep 24, 2020 at 1:22 PM Chris Hewson <ch...@resfrac.com> wrote: >>>> >>>>> Hi Guys, >>>>> >>>>> Thanks for all of the prompt responses, very helpful and appreciated. >>>>> >>>>> By "when debugging", did you mean when configure >>>>> petsc --with-debugging=1 COPTFLAGS=-O0 -g etc or when you attached a >>>>> debugger? >>>>> - Both, I have run with a debugger attached and detached, all compiled >>>>> with the following flags "--with-debugging=1 COPTFLAGS=-O0 -g" >>>>> >>>>> 1) Try OpenMPI (probably won't help, but worth trying) >>>>> - Worth a try for sure >>>>> >>>>> 2) Find which part of the simulation makes it non-deterministic. Is it >>>>> the mesh partitioning (parmetis)? Then try to make it deterministic. >>>>> - Good tip, it is the mesh partitioning and along the lines of a >>>>> question from Barry, the matrix size is changing. I will make this >>>>> deterministic and give it a try >>>>> >>>>> 3) Dump matrices, vectors, etc and see when it fails, you can quickly >>>>> reproduce the error by reading in the intermediate data. >>>>> - Also a great suggestion, will give it a try >>>>> >>>>> The full stack would be really useful here. I am guessing this happens >>>>> on MatMult(), but I do not know. >>>>> - Agreed, I am currently running it so that the full stack will be >>>>> produced, but waiting for it to fail, had compiled with all available >>>>> optimizations on, but downside is of course if there is a failure. >>>>> As a general question, roughly what's the performance impact on petsc >>>>> with -o1/-o2/-o0 as opposed to -o3? Performance impact of --with-debugging >>>>> = 1? >>>>> Obviously problem/machine dependant, wondering on guidance more for >>>>> this than anything >>>>> >>>>> Is the nonzero structure of your matrices changing or is it fixed for >>>>> the entire simulation? >>>>> The non-zero structure is changing, although the structures are >>>>> reformed when this happens and this happens thousands of time before the >>>>> failure has occured. >>>>> >>>> >>>> Okay, this is the most likely spot for a bug. How are you changing the >>>> matrix? It should be impossible to put in an invalid column index when >>>> using MatSetValues() >>>> because we check all the inputs. However, I do not think we check when >>>> you just yank out the arrays. >>>> >>>> Thanks, >>>> >>>> Matt >>>> >>>> >>>>> Does this particular run always crash at the same place? Similar >>>>> place? Doesn't always crash? >>>>> Doesn't always crash, but other similar runs have crashed in different >>>>> spots, which makes it difficult to resolve. I am going to try out a few of >>>>> the strategies suggested above and will let you know what comes of that. >>>>> >>>>> *Chris Hewson* >>>>> Senior Reservoir Simulation Engineer >>>>> ResFrac >>>>> +1.587.575.9792 >>>>> >>>>> >>>>> On Thu, Sep 24, 2020 at 11:05 AM Barry Smith <bsm...@petsc.dev> wrote: >>>>> >>>>>> Chris, >>>>>> >>>>>> We realize how frustrating this type of problem is to deal with. >>>>>> >>>>>> Here is the code: >>>>>> >>>>>> ierr = >>>>>> PetscTableCreate(aij->B->rmap->n,mat->cmap->N+1,&gid1_lid1);CHKERRQ(ierr); >>>>>> for (i=0; i<aij->B->rmap->n; i++) { >>>>>> for (j=0; j<B->ilen[i]; j++) { >>>>>> PetscInt data,gid1 = aj[B->i[i] + j] + 1; >>>>>> ierr = PetscTableFind(gid1_lid1,gid1,&data);CHKERRQ(ierr); >>>>>> if (!data) { >>>>>> /* one based table */ >>>>>> ierr = >>>>>> PetscTableAdd(gid1_lid1,gid1,++ec,INSERT_VALUES);CHKERRQ(ierr); >>>>>> } >>>>>> } >>>>>> } >>>>>> >>>>>> It is simply looping over the rows of the sparse matrix putting >>>>>> the columns it finds into the hash table. >>>>>> >>>>>> aj[B->i[i] + j] are the column entries, the number of columns in >>>>>> the matrix is mat->cmap->N so the column entries should always be >>>>>> less than the number of columns. The code is crashing when column >>>>>> entry 24443 which is larger than the number of columns 23988. >>>>>> >>>>>> So either the aj[B->i[i] + j] + 1 are incorrect or the mat->cmap->N >>>>>> is incorrect. >>>>>> >>>>>> 640]PETSC ERROR: #3 MatAssemblyEnd_MPIAIJ() line 876 in >>>>>>>>>>> /home/wangy2/trunk/sawtooth/griffin/moose/petsc/src/mat/impls/aij/mpi/mpiaij.c >>>>>>>>>>> >>>>>>>>>>> >>>>>> if (!mat->was_assembled && mode == MAT_FINAL_ASSEMBLY) { >>>>>> ierr = MatSetUpMultiply_MPIAIJ(mat);CHKERRQ(ierr); >>>>>> } >>>>>> >>>>>> Seems to indicate it is setting up a new multiple because it is >>>>>> either the first time into the algorithm or the nonzero structure changed >>>>>> on some rank requiring a new assembly process. >>>>>> >>>>>> Is the nonzero structure of your matrices changing or is it fixed >>>>>> for the entire simulation? >>>>>> >>>>>> Since the code has been running for a very long time already I have >>>>>> to conclude that this is not the first time through and so something has >>>>>> changed in the matrix? >>>>>> >>>>>> I think we have to put more diagnostics into the library to provide >>>>>> more information before or at the time of the error detection. >>>>>> >>>>>> Does this particular run always crash at the same place? Similar >>>>>> place? Doesn't always crash? >>>>>> >>>>>> Barry >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> On Sep 24, 2020, at 8:46 AM, Chris Hewson <ch...@resfrac.com> wrote: >>>>>> >>>>>> After about a month of not having this issue pop up, it has come up >>>>>> again >>>>>> >>>>>> We have been struggling with a similar PETSc Error for awhile now, >>>>>> the error message is as follows: >>>>>> >>>>>> [7]PETSC ERROR: PetscTableFind() line 132 in >>>>>> /home/chewson/petsc-3.13.3/include/petscctable.h key 24443 is greater >>>>>> than >>>>>> largest key allowed 23988 >>>>>> >>>>>> It is a particularly nasty bug as it doesn't reproduce itself when >>>>>> debugging and doesn't happen all the time with the same inputs either. >>>>>> The >>>>>> problem occurs after a long runtime of the code (12-40 hours) and we are >>>>>> using a ksp solve with KSPBCGS. >>>>>> >>>>>> The PETSc compilation options that are used are: >>>>>> >>>>>> --download-metis >>>>>> --download-mpich >>>>>> --download-mumps >>>>>> --download-parmetis >>>>>> --download-ptscotch >>>>>> --download-scalapack >>>>>> --download-suitesparse >>>>>> --prefix=/opt/anl/petsc-3.13.3 >>>>>> --with-debugging=0 >>>>>> --with-mpi=1 >>>>>> COPTFLAGS=-O3 -march=haswell -mtune=haswell >>>>>> CXXOPTFLAGS=-O3 -march=haswell -mtune=haswell >>>>>> FOPTFLAGS=-O3 -march=haswell -mtune=haswell >>>>>> >>>>>> This is being run across 8 processes and is failing consistently on >>>>>> the rank 7 process. We also use openmp outside of PETSC and the linear >>>>>> solve portion of the code. The rank 0 process is always using compute, >>>>>> during this the slave processes use an MPI_Wait call to wait for the >>>>>> collective parts of the code to be called again. All PETSC calls are done >>>>>> across all of the processes. >>>>>> >>>>>> We are using mpich version 3.3.2, downloaded with the petsc 3.13.3 >>>>>> package. >>>>>> >>>>>> At every PETSC call we are checking the error return from the >>>>>> function collectively to ensure that no errors have been returned from >>>>>> PETSC. >>>>>> >>>>>> Some possible causes that I can think of are as follows: >>>>>> 1. Memory leak causing a corruption either in our program or in petsc >>>>>> or with one of the petsc objects. This seems unlikely as we have checked >>>>>> runs with the option -malloc_dump for PETSc and using valgrind. >>>>>> >>>>>> 2. Optimization flags set for petsc compilation are causing variables >>>>>> that go out of scope to be optimized out. >>>>>> >>>>>> 3. We are giving the wrong number of elements for a process or the >>>>>> value is changing during the simulation. This seems unlikely as there is >>>>>> nothing overly unique about these simulations and it's not reproducing >>>>>> itself. >>>>>> >>>>>> 4. An MPI channel or socket error causing an error in the collective >>>>>> values for PETSc. >>>>>> >>>>>> Any input on this issue would be greatly appreciated. >>>>>> >>>>>> *Chris Hewson* >>>>>> Senior Reservoir Simulation Engineer >>>>>> ResFrac >>>>>> +1.587.575.9792 >>>>>> >>>>>> >>>>>> On Thu, Aug 13, 2020 at 4:21 PM Junchao Zhang < >>>>>> junchao.zh...@gmail.com> wrote: >>>>>> >>>>>>> That is a great idea. I'll figure it out. >>>>>>> --Junchao Zhang >>>>>>> >>>>>>> >>>>>>> On Thu, Aug 13, 2020 at 4:31 PM Barry Smith <bsm...@petsc.dev> >>>>>>> wrote: >>>>>>> >>>>>>>> >>>>>>>> Junchao, >>>>>>>> >>>>>>>> Any way in the PETSc configure to warn that MPICH version is >>>>>>>> "bad" or "untrustworthy" or even the vague "we have suspicians about >>>>>>>> this >>>>>>>> version and recommend avoiding it"? A lot of time could be saved if >>>>>>>> others >>>>>>>> don't deal with the same problem. >>>>>>>> >>>>>>>> Maybe add arrays of suspect_versions for OpenMPI, MPICH, etc >>>>>>>> and always check against that list and print a boxed warning at >>>>>>>> configure >>>>>>>> time? Better you could somehow generalize it and put it in package.py >>>>>>>> for >>>>>>>> use by all packages, then any package can included lists of "suspect" >>>>>>>> versions. (There are definitely HDF5 versions that should be avoided >>>>>>>> :-)). >>>>>>>> >>>>>>>> Barry >>>>>>>> >>>>>>>> >>>>>>>> On Aug 13, 2020, at 12:14 PM, Junchao Zhang < >>>>>>>> junchao.zh...@gmail.com> wrote: >>>>>>>> >>>>>>>> Thanks for the update. Let's assume it is a bug in MPI :) >>>>>>>> --Junchao Zhang >>>>>>>> >>>>>>>> >>>>>>>> On Thu, Aug 13, 2020 at 11:15 AM Chris Hewson <ch...@resfrac.com> >>>>>>>> wrote: >>>>>>>> >>>>>>>>> Just as an update to this, I can confirm that using the mpich >>>>>>>>> version (3.3.2) downloaded with the petsc download solved this issue >>>>>>>>> on my >>>>>>>>> end. >>>>>>>>> >>>>>>>>> *Chris Hewson* >>>>>>>>> Senior Reservoir Simulation Engineer >>>>>>>>> ResFrac >>>>>>>>> +1.587.575.9792 >>>>>>>>> >>>>>>>>> >>>>>>>>> On Thu, Jul 23, 2020 at 5:58 PM Junchao Zhang < >>>>>>>>> junchao.zh...@gmail.com> wrote: >>>>>>>>> >>>>>>>>>> On Mon, Jul 20, 2020 at 7:05 AM Barry Smith <bsm...@petsc.dev> >>>>>>>>>> wrote: >>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> Is there a comprehensive MPI test suite (perhaps from >>>>>>>>>>> MPICH)? Is there any way to run this full test suite under the >>>>>>>>>>> problematic >>>>>>>>>>> MPI and see if it detects any problems? >>>>>>>>>>> >>>>>>>>>>> Is so, could someone add it to the FAQ in the debugging >>>>>>>>>>> section? >>>>>>>>>>> >>>>>>>>>> MPICH does have a test suite. It is at the subdir test/mpi of >>>>>>>>>> downloaded mpich >>>>>>>>>> <http://www.mpich.org/static/downloads/3.3.2/mpich-3.3.2.tar.gz>. >>>>>>>>>> It annoyed me since it is not user-friendly. It might be helpful in >>>>>>>>>> catching bugs at very small scale. But say if I want to test >>>>>>>>>> allreduce on >>>>>>>>>> 1024 ranks on 100 doubles, I have to hack the test suite. >>>>>>>>>> Anyway, the instructions are here. >>>>>>>>>> >>>>>>>>>> For the purpose of petsc, under test/mpi one can configure it with >>>>>>>>>> $./configure CC=mpicc CXX=mpicxx FC=mpifort --enable-strictmpi >>>>>>>>>> --enable-threads=funneled --enable-fortran=f77,f90 --enable-fast >>>>>>>>>> --disable-spawn --disable-cxx --disable-ft-tests // It is weird I >>>>>>>>>> disabled >>>>>>>>>> cxx but I had to set CXX! >>>>>>>>>> $make -k -j8 // -k is to keep going and ignore compilation >>>>>>>>>> errors, e.g., when building tests for MPICH extensions not in MPI >>>>>>>>>> standard, >>>>>>>>>> but your MPI is OpenMPI. >>>>>>>>>> $ // edit testlist, remove lines mpi_t, rma, f77, impls. Those >>>>>>>>>> are sub-dirs containing tests for MPI routines Petsc does not rely >>>>>>>>>> on. >>>>>>>>>> $ make testings or directly './runtests -tests=testlist' >>>>>>>>>> >>>>>>>>>> On a batch system, >>>>>>>>>> $export MPITEST_BATCHDIR=`pwd`/btest // specify a batch >>>>>>>>>> dir, say btest, >>>>>>>>>> $./runtests -batch -mpiexec=mpirun -np=1024 -tests=testlist // >>>>>>>>>> Use 1024 ranks if a test does no specify the number of processes. >>>>>>>>>> $ // It copies test binaries to the batch dir and generates a >>>>>>>>>> script runtests.batch there. Edit the script to fit your batch >>>>>>>>>> system and >>>>>>>>>> then submit a job and wait for its finish. >>>>>>>>>> $ cd btest && ../checktests --ignorebogus >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> PS: Fande, changing an MPI fixed your problem does not >>>>>>>>>> necessarily mean the old MPI has bugs. It is complicated. It could >>>>>>>>>> be a >>>>>>>>>> petsc bug. You need to provide us a code to reproduce your error. >>>>>>>>>> It does >>>>>>>>>> not matter if the code is big. >>>>>>>>>> >>>>>>>>>> >>>>>>>>>>> Thanks >>>>>>>>>>> >>>>>>>>>>> Barry >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> On Jul 20, 2020, at 12:16 AM, Fande Kong <fdkong...@gmail.com> >>>>>>>>>>> wrote: >>>>>>>>>>> >>>>>>>>>>> Trace could look like this: >>>>>>>>>>> >>>>>>>>>>> [640]PETSC ERROR: --------------------- Error Message >>>>>>>>>>> -------------------------------------------------------------- >>>>>>>>>>> [640]PETSC ERROR: Argument out of range >>>>>>>>>>> [640]PETSC ERROR: key 45226154 is greater than largest key >>>>>>>>>>> allowed 740521 >>>>>>>>>>> [640]PETSC ERROR: See >>>>>>>>>>> https://www.mcs.anl.gov/petsc/documentation/faq.html for >>>>>>>>>>> trouble shooting. >>>>>>>>>>> [640]PETSC ERROR: Petsc Release Version 3.13.3, unknown >>>>>>>>>>> [640]PETSC ERROR: ../../griffin-opt on a arch-moose named >>>>>>>>>>> r6i5n18 by wangy2 Sun Jul 19 17:14:28 2020 >>>>>>>>>>> [640]PETSC ERROR: Configure options --download-hypre=1 >>>>>>>>>>> --with-debugging=no --with-shared-libraries=1 >>>>>>>>>>> --download-fblaslapack=1 >>>>>>>>>>> --download-metis=1 --download-ptscotch=1 --download-parmetis=1 >>>>>>>>>>> --download-superlu_dist=1 --download-mumps=1 --download-scalapack=1 >>>>>>>>>>> --download-slepc=1 --with-mpi=1 --with-cxx-dialect=C++11 >>>>>>>>>>> --with-fortran-bindings=0 --with-sowing=0 --with-64-bit-indices >>>>>>>>>>> --download-mumps=0 >>>>>>>>>>> [640]PETSC ERROR: #1 PetscTableFind() line 132 in >>>>>>>>>>> /home/wangy2/trunk/sawtooth/griffin/moose/petsc/include/petscctable.h >>>>>>>>>>> [640]PETSC ERROR: #2 MatSetUpMultiply_MPIAIJ() line 33 in >>>>>>>>>>> /home/wangy2/trunk/sawtooth/griffin/moose/petsc/src/mat/impls/aij/mpi/mmaij.c >>>>>>>>>>> [640]PETSC ERROR: #3 MatAssemblyEnd_MPIAIJ() line 876 in >>>>>>>>>>> /home/wangy2/trunk/sawtooth/griffin/moose/petsc/src/mat/impls/aij/mpi/mpiaij.c >>>>>>>>>>> [640]PETSC ERROR: #4 MatAssemblyEnd() line 5347 in >>>>>>>>>>> /home/wangy2/trunk/sawtooth/griffin/moose/petsc/src/mat/interface/matrix.c >>>>>>>>>>> [640]PETSC ERROR: #5 MatPtAPNumeric_MPIAIJ_MPIXAIJ_allatonce() >>>>>>>>>>> line 901 in >>>>>>>>>>> /home/wangy2/trunk/sawtooth/griffin/moose/petsc/src/mat/impls/aij/mpi/mpiptap.c >>>>>>>>>>> [640]PETSC ERROR: #6 MatPtAPNumeric_MPIAIJ_MPIMAIJ_allatonce() >>>>>>>>>>> line 3180 in >>>>>>>>>>> /home/wangy2/trunk/sawtooth/griffin/moose/petsc/src/mat/impls/maij/maij.c >>>>>>>>>>> [640]PETSC ERROR: #7 MatProductNumeric_PtAP() line 704 in >>>>>>>>>>> /home/wangy2/trunk/sawtooth/griffin/moose/petsc/src/mat/interface/matproduct.c >>>>>>>>>>> [640]PETSC ERROR: #8 MatProductNumeric() line 759 in >>>>>>>>>>> /home/wangy2/trunk/sawtooth/griffin/moose/petsc/src/mat/interface/matproduct.c >>>>>>>>>>> [640]PETSC ERROR: #9 MatPtAP() line 9199 in >>>>>>>>>>> /home/wangy2/trunk/sawtooth/griffin/moose/petsc/src/mat/interface/matrix.c >>>>>>>>>>> [640]PETSC ERROR: #10 MatGalerkin() line 10236 in >>>>>>>>>>> /home/wangy2/trunk/sawtooth/griffin/moose/petsc/src/mat/interface/matrix.c >>>>>>>>>>> [640]PETSC ERROR: #11 PCSetUp_MG() line 745 in >>>>>>>>>>> /home/wangy2/trunk/sawtooth/griffin/moose/petsc/src/ksp/pc/impls/mg/mg.c >>>>>>>>>>> [640]PETSC ERROR: #12 PCSetUp_HMG() line 220 in >>>>>>>>>>> /home/wangy2/trunk/sawtooth/griffin/moose/petsc/src/ksp/pc/impls/hmg/hmg.c >>>>>>>>>>> [640]PETSC ERROR: #13 PCSetUp() line 898 in >>>>>>>>>>> /home/wangy2/trunk/sawtooth/griffin/moose/petsc/src/ksp/pc/interface/precon.c >>>>>>>>>>> [640]PETSC ERROR: #14 KSPSetUp() line 376 in >>>>>>>>>>> /home/wangy2/trunk/sawtooth/griffin/moose/petsc/src/ksp/ksp/interface/itfunc.c >>>>>>>>>>> [640]PETSC ERROR: #15 KSPSolve_Private() line 633 in >>>>>>>>>>> /home/wangy2/trunk/sawtooth/griffin/moose/petsc/src/ksp/ksp/interface/itfunc.c >>>>>>>>>>> [640]PETSC ERROR: #16 KSPSolve() line 853 in >>>>>>>>>>> /home/wangy2/trunk/sawtooth/griffin/moose/petsc/src/ksp/ksp/interface/itfunc.c >>>>>>>>>>> [640]PETSC ERROR: #17 SNESSolve_NEWTONLS() line 225 in >>>>>>>>>>> /home/wangy2/trunk/sawtooth/griffin/moose/petsc/src/snes/impls/ls/ls.c >>>>>>>>>>> [640]PETSC ERROR: #18 SNESSolve() line 4519 in >>>>>>>>>>> /home/wangy2/trunk/sawtooth/griffin/moose/petsc/src/snes/interface/snes.c >>>>>>>>>>> >>>>>>>>>>> On Sun, Jul 19, 2020 at 11:13 PM Fande Kong <fdkong...@gmail.com> >>>>>>>>>>> wrote: >>>>>>>>>>> >>>>>>>>>>>> I am not entirely sure what is happening, but we encountered >>>>>>>>>>>> similar issues recently. It was not reproducible. It might occur >>>>>>>>>>>> at >>>>>>>>>>>> different stages, and errors could be weird other than "ctable >>>>>>>>>>>> stuff." Our >>>>>>>>>>>> code was Valgrind clean since every PR in moose needs to go through >>>>>>>>>>>> rigorous Valgrind checks before it reaches the devel branch. The >>>>>>>>>>>> errors >>>>>>>>>>>> happened when we used mvapich. >>>>>>>>>>>> >>>>>>>>>>>> We changed to use HPE-MPT (a vendor stalled MPI), then >>>>>>>>>>>> everything was smooth. May you try a different MPI? It is better >>>>>>>>>>>> to try a >>>>>>>>>>>> system carried one. >>>>>>>>>>>> >>>>>>>>>>>> We did not get the bottom of this problem yet, but we at least >>>>>>>>>>>> know this is kind of MPI-related. >>>>>>>>>>>> >>>>>>>>>>>> Thanks, >>>>>>>>>>>> >>>>>>>>>>>> Fande, >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> On Sun, Jul 19, 2020 at 3:28 PM Chris Hewson <ch...@resfrac.com> >>>>>>>>>>>> wrote: >>>>>>>>>>>> >>>>>>>>>>>>> Hi, >>>>>>>>>>>>> >>>>>>>>>>>>> I am having a bug that is occurring in PETSC with the return >>>>>>>>>>>>> string: >>>>>>>>>>>>> >>>>>>>>>>>>> [7]PETSC ERROR: PetscTableFind() line 132 in >>>>>>>>>>>>> /home/chewson/petsc-3.13.2/include/petscctable.h key 7556 is >>>>>>>>>>>>> greater than >>>>>>>>>>>>> largest key allowed 5693 >>>>>>>>>>>>> >>>>>>>>>>>>> This is using petsc-3.13.2, compiled and running using mpich >>>>>>>>>>>>> with -O3 and debugging turned off tuned to the haswell >>>>>>>>>>>>> architecture and >>>>>>>>>>>>> occurring either before or during a KSPBCGS solve/setup or during >>>>>>>>>>>>> a MUMPS >>>>>>>>>>>>> factorization solve (I haven't been able to replicate this issue >>>>>>>>>>>>> with the >>>>>>>>>>>>> same set of instructions etc.). >>>>>>>>>>>>> >>>>>>>>>>>>> This is a terrible way to ask a question, I know, and not very >>>>>>>>>>>>> helpful from your side, but this is what I have from a user's run >>>>>>>>>>>>> and can't >>>>>>>>>>>>> reproduce on my end (either with the optimization compilation or >>>>>>>>>>>>> with >>>>>>>>>>>>> debugging turned on). This happens when the code has run for >>>>>>>>>>>>> quite some >>>>>>>>>>>>> time and is happening somewhat rarely. >>>>>>>>>>>>> >>>>>>>>>>>>> More than likely I am using a static variable (code is written >>>>>>>>>>>>> in c++) that I'm not updating when the matrix size is changing or >>>>>>>>>>>>> something >>>>>>>>>>>>> silly like that, but any help or guidance on this would be >>>>>>>>>>>>> appreciated. >>>>>>>>>>>>> >>>>>>>>>>>>> *Chris Hewson* >>>>>>>>>>>>> Senior Reservoir Simulation Engineer >>>>>>>>>>>>> ResFrac >>>>>>>>>>>>> +1.587.575.9792 >>>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>> >>>>>>>> >>>>>> >>>> >>>> -- >>>> What most experimenters take for granted before they begin their >>>> experiments is infinitely more interesting than any results to which their >>>> experiments lead. >>>> -- Norbert Wiener >>>> >>>> https://www.cse.buffalo.edu/~knepley/ >>>> <http://www.cse.buffalo.edu/~knepley/> >>>> >>>> >>>> >>> >>> -- >>> What most experimenters take for granted before they begin their >>> experiments is infinitely more interesting than any results to which their >>> experiments lead. >>> -- Norbert Wiener >>> >>> https://www.cse.buffalo.edu/~knepley/ >>> <http://www.cse.buffalo.edu/~knepley/> >>> >>> >>> -- What most experimenters take for granted before they begin their experiments is infinitely more interesting than any results to which their experiments lead. -- Norbert Wiener https://www.cse.buffalo.edu/~knepley/ <http://www.cse.buffalo.edu/~knepley/>