Everyone,
Previously we checked the bounds range for the debug version of the code but not the optimized version. Based on Mark's experience I felt that the tiny hit on performance on checking was worth it all the time and our intention is now to always check these bounds. Barry > On Nov 3, 2020, at 7:30 AM, Matthew Knepley <knep...@gmail.com> wrote: > > On Tue, Nov 3, 2020 at 8:23 AM Mark McClure <m...@resfrac.com > <mailto:m...@resfrac.com>> wrote: > Hi, all. > > I am emailing to close the loop on this. > > There were two things combining to cause our issue. > > 1. At some point, several years ago, I had set > PetscPushErrorHandler(PetscAbortErrorHandler, NULL), and then forgotten about > it. This caused the program to terminate when an error was encountered. > > 2. I believe our code had a very rare, nonreproducible bug (occurring once > every 1000s of hours of runtime) where it passed column and/or row values to > Petsc that were greater than the size of the matrix. > > Having changed the error handler, and also put in a special error check to > check the column/row assignments and abort (and then rerun with smaller dt) a > timestep if they are out of range. Having done that, we've run for over a > month and have not seen the problem reproduce. > > Two ideas that might be helpful to avoid this problem in the future: > > 1. When Petsc aborts with error because > PetscPushErrorHandler(PetscAbortErrorHandler, NULL) is set, it might be > helpful to add to the error message that is printed to cout. "Petsc is > aborting the program because the Petsc Error Handler has been set to abort on > errors. This can be changed by modifying the option passed into the error > handler." Otherwise, it might be unclear to a user why Petsc is aborting (if > abort on error is set, but they don't realize). We had written error checks > to catch errors passed out of Petsc and handle gracefully. But had overlooked > the error handler setting, and so were confused why this error was causing > the entire program to terminate. A bit more explanation in the abort message > could help avoid that kind of user confusion. I had thought it was a hard > crash out of Petsc, since I was confused why the function wasn't returning. > > Hi Mark, > > That is a good suggestion. I am doing it. > > 2. When the "greater than largest key allowed" error is encountered, it might > be helpful to add to the warning message printed to users to additionally > say: "This error may occur if the column and row values passed into Petsc are > within the range from 0 to size of matrix." Not positive, but I am about 90% > sure that is why we were hitting the error. > > Can you help me understand this? It should be impossible to pass row or > column indices that are greater than the matrix size through MatSetValues: > > > https://gitlab.com/petsc/petsc/-/blob/master/src/mat/impls/aij/mpi/mpiaij.c#L582 > > <https://gitlab.com/petsc/petsc/-/blob/master/src/mat/impls/aij/mpi/mpiaij.c#L582> > > What are you calling to get them in? > > Thanks, > > Matt > > Thank you again for your help. I really appreciate your rapid responses and > help! > > Mark > > > > > > On Sat, Sep 26, 2020 at 10:53 PM Mark McClure <m...@resfrac.com > <mailto:m...@resfrac.com>> wrote: > Ok, I think we've made some progress. > > We were already calling the function like this: ierr = PetscCall(); if (ierr > != 0) {do something to handle error}. We actually are doing that on every > single call made to Petsc, just to be careful. This is what was confusing to > me. Why was the program terminating from within Petsc and not returning out > with an error? We'd written our code so that if Petsc did return with an > error, we'd discard the full timestep, destroy all the Petsc data structures > and redo everything with a smaller dt. So if Petsc did hit this error very > rarely, we might be able to recovery gracefully. > > It does not appear to be seg faulting. So it seemed that the program was > being terminated intentionally from within Petsc, which was puzzling, and why > I was asking about that in my previous email. > > So - Chris made a great find. Turns out that right after PetscInitialize in > our main.cpp, we had the line: > > PetscPushErrorHandler(PetscAbortErrorHandler, NULL); > > Which was telling Petsc to call MPI_Abort if there was an error. I probably > put that line into the code years ago and forgot it was there. So, as Barry > said, if we change the PetscErrorHandler option to ignore, then at least we > can avoid the program aborting on the error, and hopefully be able to recover > with our existing code logic. > > Also, may have found a clue on the root cause of the error. I had thought > that were were checking all of our inputs to Petsc for issues such as out of > range index values. But I went back and see that due to a versioning mistake, > there is one particular error check on our inputs that was being removed from > production builds by a preprocessor definition. Which means that it wouldn't > be caught in our production builds, which means that it is possible that bad > inputs could have been passed into Petsc. I don't know for sure - but is > plausible. The missing error check was doing the following: checking to see > if the fourth entry to MatSetValues ("n", for the number of nonzero values in > the row) > (https://www.mcs.anl.gov/petsc/petsc-current/docs/manualpages/Mat/MatSetValues.html > > <https://www.mcs.anl.gov/petsc/petsc-current/docs/manualpages/Mat/MatSetValues.html>) > was equal to the sum of the number of diagonal and off diagonal values that > we had specified in our previous call to MatMPIAIJSetPreallocation. > > So that is at least a theory for what was happening. The theory would be: > very rarely, due to a bug in our code, we were running MatSetValues with "n" > set to a value not equal to the number of nonzero values promised in the call > to MatMPIAIJSetPreallocation. Maybe this led to the Petsc "key 7556 is > greater than largest key allowed 5693" error message, and then our setting of > 'PetscAbortErrorHandler' was causing the program to abort. > > Mark > > > > > > > > On Sat, Sep 26, 2020 at 9:51 PM Barry Smith <bsm...@petsc.dev > <mailto:bsm...@petsc.dev>> wrote: > > >> On Sep 26, 2020, at 5:58 PM, Junchao Zhang <junchao.zh...@gmail.com >> <mailto:junchao.zh...@gmail.com>> wrote: >> >> >> >> On Sat, Sep 26, 2020 at 5:44 PM Mark Adams <mfad...@lbl.gov >> <mailto:mfad...@lbl.gov>> wrote: >> >> >> On Sat, Sep 26, 2020 at 1:07 PM Matthew Knepley <knep...@gmail.com >> <mailto:knep...@gmail.com>> wrote: >> On Sat, Sep 26, 2020 at 11:17 AM Mark McClure <m...@resfrac.com >> <mailto:m...@resfrac.com>> wrote: >> Thank you, all for the explanations. >> >> Following Matt's suggestion, we'll use -g (and not use -with-debugging=0) >> all future compiles to all users, so in future, we can provide better >> information. >> >> Second, Chris is going to boil our function down to minimum stub and share >> in case there is some subtle issue with the way the functions are being >> called. >> >> Third, I have question/request - Petsc is, in fact, detecting an error. As >> far as I can tell, this is not an uncontrolled 'seg fault'. It seems to me >> that maybe Petsc could choose to return out from the function when it >> detects this error, returning an error code, rather than dumping the core >> and terminating the program. If Petsc simply returned out with an error >> message, this would resolve the problem for us. After the Petsc call, we >> check for Petsc error messages. If Petsc returns an error - that's fine - we >> use a direct solver as a backup, and the simulation continues. So - I am not >> sure whether this is feasible - but if Petsc could return out with an error >> message - rather than dumping the core and terminating the program - then >> that would effectively resolve the issue for us. Would this change be >> possible? >> >> At some level, I think it is currently doing what you want. CHKERRQ() simply >> returns an error code from that function call, printing an error message. >> Suppressing the message is harder I think, >> >> He does not need this. >> >> but for now, if you know what function call is causing the error, you can >> just catch the (ierr != 0) yourself instead of using CHKERRQ. >> >> This is what I suggested earlier but maybe I was not clear enough. >> >> Your code calls something like >> >> ierr = SNESSolve(....); CHKERRQ(ierr); >> >> You can replace this with: >> >> ierr = SNESSolve(....); >> if (ierr) { >> How to deal with CHKERRQ(ierr); inside SNESSolve()? > > > PetscPushErrorHandler(PetscIgnoreErrorHandler,NULL); > > But the problem in this code's runs appear to be due to corrupt data, why > and how it gets corrupted is not known. Continuing with an alternative solver > because a solver failed for numerical or algorithmic reasons is generally > fine but continuing when there is corrupted data is always iffy because one > doesn't know how far the corruption has spread. SNESDestroy(&snes); > SNESCreate(&snes); may likely clean out any potentially corrupted data but if > the corruption got into the mesh data structures it will still be there. > > A very sophisticated library code would, when it detects this type of > corruption, sweep through all the data structures looking for any indications > of corruption, to help determine the cause and/or even fix the problem. We > don't have this code in place, though we could add some, because generally we > relay on valgrind or -malloc_debug to detect such corruption, the problem is > valgrind and -malloc_debug don't fit well in a production environment. > Handling corruption that comes up in production but not testing is difficult. > > Barry > > > > >> .... >> } >> >> I suggested something earlier to do here. Maybe call KSPView. You could even >> destroy the solver and start the solver from scratch and see if that works. >> >> Mark >> >> The drawback here is that we might not have cleaned up >> all the state so that restarting makes sense. It should be possible to just >> kill the solve, reset the solver, and retry, although it is not clear to me >> at first glance if MPI will be in an okay state. >> > > > > -- > What most experimenters take for granted before they begin their experiments > is infinitely more interesting than any results to which their experiments > lead. > -- Norbert Wiener > > https://www.cse.buffalo.edu/~knepley/ <http://www.cse.buffalo.edu/~knepley/>