Sorry, meant to say " are *not* within the range from 0 to size of matrix"
On Tue, Nov 3, 2020 at 8:22 AM Mark McClure <m...@resfrac.com> wrote: > Hi, all. > > I am emailing to close the loop on this. > > There were two things combining to cause our issue. > > 1. At some point, several years ago, I had set > PetscPushErrorHandler(PetscAbortErrorHandler, NULL), and then forgotten > about it. This caused the program to terminate when an error was > encountered. > > 2. I believe our code had a very rare, nonreproducible bug (occurring once > every 1000s of hours of runtime) where it passed column and/or row values > to Petsc that were greater than the size of the matrix. > > Having changed the error handler, and also put in a special error check to > check the column/row assignments and abort (and then rerun with smaller dt) > a timestep if they are out of range. Having done that, we've run for over a > month and have not seen the problem reproduce. > > Two ideas that might be helpful to avoid this problem in the future: > > 1. When Petsc aborts with error because > PetscPushErrorHandler(PetscAbortErrorHandler, NULL) is set, it might be > helpful to add to the error message that is printed to cout. "Petsc is > aborting the program because the Petsc Error Handler has been set to abort > on errors. This can be changed by modifying the option passed into the > error handler." Otherwise, it might be unclear to a user why Petsc is > aborting (if abort on error is set, but they don't realize). We had written > error checks to catch errors passed out of Petsc and handle gracefully. But > had overlooked the error handler setting, and so were confused why this > error was causing the entire program to terminate. A bit more explanation > in the abort message could help avoid that kind of user confusion. I had > thought it was a hard crash out of Petsc, since I was confused why the > function wasn't returning. > > 2. When the "greater than largest key allowed" error is encountered, it > might be helpful to add to the warning message printed to users to > additionally say: "This error may occur if the column and row values passed > into Petsc are within the range from 0 to size of matrix." Not positive, > but I am about 90% sure that is why we were hitting the error. > > Thank you again for your help. I really appreciate your rapid responses > and help! > > Mark > > > > > > On Sat, Sep 26, 2020 at 10:53 PM Mark McClure <m...@resfrac.com> wrote: > >> Ok, I think we've made some progress. >> >> We were already calling the function like this: ierr = PetscCall(); if >> (ierr != 0) {do something to handle error}. We actually are doing that on >> every single call made to Petsc, just to be careful. This is what was >> confusing to me. Why was the program terminating from within Petsc and not >> returning out with an error? We'd written our code so that if Petsc did >> return with an error, we'd discard the full timestep, destroy all the Petsc >> data structures and redo everything with a smaller dt. So if Petsc did hit >> this error very rarely, we might be able to recovery gracefully. >> >> It does not appear to be seg faulting. So it seemed that the program was >> being terminated intentionally from within Petsc, which was puzzling, and >> why I was asking about that in my previous email. >> >> So - Chris made a great find. Turns out that right after PetscInitialize >> in our main.cpp, we had the line: >> >> PetscPushErrorHandler(PetscAbortErrorHandler, NULL); >> >> Which was telling Petsc to call MPI_Abort if there was an error. I >> probably put that line into the code years ago and forgot it was there. So, >> as Barry said, if we change the PetscErrorHandler option to ignore, then at >> least we can avoid the program aborting on the error, and hopefully be able >> to recover with our existing code logic. >> >> Also, may have found a clue on the root cause of the error. I had thought >> that were were checking all of our inputs to Petsc for issues such as out >> of range index values. But I went back and see that due to a versioning >> mistake, there is one particular error check on our inputs that was being >> removed from production builds by a preprocessor definition. Which means >> that it wouldn't be caught in our production builds, which means that it is >> possible that bad inputs could have been passed into Petsc. I don't know >> for sure - but is plausible. The missing error check was doing the >> following: checking to see if the fourth entry to MatSetValues ("n", for >> the number of nonzero values in the row) ( >> https://www.mcs.anl.gov/petsc/petsc-current/docs/manualpages/Mat/MatSetValues.html) >> was equal to the sum of the number of diagonal and off diagonal values that >> we had specified in our previous call to MatMPIAIJSetPreallocation. >> >> So that is at least a theory for what was happening. The theory would be: >> very rarely, due to a bug in our code, we were running MatSetValues with >> "n" set to a value not equal to the number of nonzero values promised in >> the call to MatMPIAIJSetPreallocation. Maybe this led to the Petsc "key >> 7556 is greater than largest key allowed 5693" error message, and then our >> setting of 'PetscAbortErrorHandler' was causing the program to abort. >> >> Mark >> >> >> >> >> >> >> >> On Sat, Sep 26, 2020 at 9:51 PM Barry Smith <bsm...@petsc.dev> wrote: >> >>> >>> >>> On Sep 26, 2020, at 5:58 PM, Junchao Zhang <junchao.zh...@gmail.com> >>> wrote: >>> >>> >>> >>> On Sat, Sep 26, 2020 at 5:44 PM Mark Adams <mfad...@lbl.gov> wrote: >>> >>>> >>>> >>>> On Sat, Sep 26, 2020 at 1:07 PM Matthew Knepley <knep...@gmail.com> >>>> wrote: >>>> >>>>> On Sat, Sep 26, 2020 at 11:17 AM Mark McClure <m...@resfrac.com> >>>>> wrote: >>>>> >>>>>> Thank you, all for the explanations. >>>>>> >>>>>> Following Matt's suggestion, we'll use -g (and not use >>>>>> -with-debugging=0) all future compiles to all users, so in future, we can >>>>>> provide better information. >>>>>> >>>>>> Second, Chris is going to boil our function down to minimum stub and >>>>>> share in case there is some subtle issue with the way the functions are >>>>>> being called. >>>>>> >>>>>> Third, I have question/request - Petsc is, in fact, detecting an >>>>>> error. As far as I can tell, this is not an uncontrolled 'seg fault'. It >>>>>> seems to me that maybe Petsc could choose to return out from the function >>>>>> when it detects this error, returning an error code, rather than dumping >>>>>> the core and terminating the program. If Petsc simply returned out with >>>>>> an >>>>>> error message, this would resolve the problem for us. After the Petsc >>>>>> call, >>>>>> we check for Petsc error messages. If Petsc returns an error - that's >>>>>> fine >>>>>> - we use a direct solver as a backup, and the simulation continues. So - >>>>>> I >>>>>> am not sure whether this is feasible - but if Petsc could return out with >>>>>> an error message - rather than dumping the core and terminating the >>>>>> program >>>>>> - then that would effectively resolve the issue for us. Would this change >>>>>> be possible? >>>>>> >>>>> >>>>> At some level, I think it is currently doing what you want. CHKERRQ() >>>>> simply returns an error code from that function call, printing an error >>>>> message. Suppressing the message is harder I think, >>>>> >>>> >>>> He does not need this. >>>> >>>> >>>>> but for now, if you know what function call is causing the error, you >>>>> can just catch the (ierr != 0) yourself instead of using CHKERRQ. >>>>> >>>> >>>> This is what I suggested earlier but maybe I was not clear enough. >>>> >>>> Your code calls something like >>>> >>>> ierr = SNESSolve(....); CHKERRQ(ierr); >>>> >>>> You can replace this with: >>>> >>>> ierr = SNESSolve(....); >>>> if (ierr) { >>>> >>> How to deal with CHKERRQ(ierr); inside SNESSolve()? >>> >>> >>> >>> PetscPushErrorHandler(PetscIgnoreErrorHandler,NULL); >>> >>> But the problem in this code's runs appear to be due to corrupt data, >>> why and how it gets corrupted is not known. Continuing with an alternative >>> solver because a solver failed for numerical or algorithmic reasons is >>> generally fine but continuing when there is corrupted data is always iffy >>> because one doesn't know how far the corruption has spread. >>> SNESDestroy(&snes); SNESCreate(&snes); may likely clean out any potentially >>> corrupted data but if the corruption got into the mesh data structures it >>> will still be there. >>> >>> A very sophisticated library code would, when it detects this type of >>> corruption, sweep through all the data structures looking for any >>> indications of corruption, to help determine the cause and/or even fix the >>> problem. We don't have this code in place, though we could add some, >>> because generally we relay on valgrind or -malloc_debug to detect such >>> corruption, the problem is valgrind and -malloc_debug don't fit well in a >>> production environment. Handling corruption that comes up in production but >>> not testing is difficult. >>> >>> Barry >>> >>> >>> >>> >>> .... >>>> } >>>> >>>> I suggested something earlier to do here. Maybe call KSPView. You could >>>> even destroy the solver and start the solver from scratch and see if that >>>> works. >>>> >>>> Mark >>>> >>>> >>>>> The drawback here is that we might not have cleaned up >>>>> all the state so that restarting makes sense. It should be possible to >>>>> just kill the solve, reset the solver, and retry, although it is not clear >>>>> to me at first glance if MPI will be in an okay state. >>>>> >>>>> >>>