Everyone,

    Previously we checked the bounds range for the debug version of the code 
but not the optimized version. Based on Mark's experience I felt that the tiny  
hit on performance on checking was worth it all the time and our intention is 
now to always check these bounds.

   Barry

> On Nov 3, 2020, at 7:30 AM, Matthew Knepley <knep...@gmail.com> wrote:
> 
> On Tue, Nov 3, 2020 at 8:23 AM Mark McClure <m...@resfrac.com 
> <mailto:m...@resfrac.com>> wrote:
> Hi, all.
> 
> I am emailing to close the loop on this. 
> 
> There were two things combining to cause our issue.
> 
> 1. At some point, several years ago, I had set 
> PetscPushErrorHandler(PetscAbortErrorHandler, NULL), and then forgotten about 
> it. This caused the program to terminate when an error was encountered.
> 
> 2. I believe our code had a very rare, nonreproducible bug (occurring once 
> every 1000s of hours of runtime) where it passed column and/or row values to 
> Petsc that were greater than the size of the matrix.
> 
> Having changed the error handler, and also put in a special error check to 
> check the column/row assignments and abort (and then rerun with smaller dt) a 
> timestep if they are out of range. Having done that, we've run for over a 
> month and have not seen the problem reproduce.
> 
> Two ideas that might be helpful to avoid this problem in the future:
> 
> 1. When Petsc aborts with error because 
> PetscPushErrorHandler(PetscAbortErrorHandler, NULL) is set, it might be 
> helpful to add to the error message that is printed to cout. "Petsc is 
> aborting the program because the Petsc Error Handler has been set to abort on 
> errors. This can be changed by modifying the option passed into the error 
> handler." Otherwise, it might be unclear to a user why Petsc is aborting (if 
> abort on error is set, but they don't realize). We had written error checks 
> to catch errors passed out of Petsc and handle gracefully. But had overlooked 
> the error handler setting, and so were confused why this error was causing 
> the entire program to terminate. A bit more explanation in the abort message 
> could help avoid that kind of user confusion. I had thought it was a hard 
> crash out of Petsc, since I was confused why the function wasn't returning.
> 
> Hi Mark,
> 
> That is a good suggestion. I am doing it.
>  
> 2. When the "greater than largest key allowed" error is encountered, it might 
> be helpful to add to the warning message printed to users to additionally 
> say: "This error may occur if the column and row values passed into Petsc are 
> within the range from 0 to size of matrix." Not positive, but I am about 90% 
> sure that is why we were hitting the error.
> 
> Can you help me understand this? It should be impossible to pass row or 
> column indices that are greater than the matrix size through MatSetValues:
> 
>   
> https://gitlab.com/petsc/petsc/-/blob/master/src/mat/impls/aij/mpi/mpiaij.c#L582
>  
> <https://gitlab.com/petsc/petsc/-/blob/master/src/mat/impls/aij/mpi/mpiaij.c#L582>
> 
> What are you calling to get them in?
> 
>   Thanks,
> 
>     Matt
>  
> Thank you again for your help. I really appreciate your rapid responses and 
> help!
> 
> Mark
> 
> 
> 
> 
> 
> On Sat, Sep 26, 2020 at 10:53 PM Mark McClure <m...@resfrac.com 
> <mailto:m...@resfrac.com>> wrote:
> Ok, I think we've made some progress.
> 
> We were already calling the function like this: ierr = PetscCall(); if (ierr 
> != 0) {do something to handle error}. We actually are doing that on every 
> single call made to Petsc, just to be careful. This is what was confusing to 
> me. Why was the program terminating from within Petsc and not returning out 
> with an error? We'd written our code so that if Petsc did return with an 
> error, we'd discard the full timestep, destroy all the Petsc data structures 
> and redo everything with a smaller dt. So if Petsc did hit this error very 
> rarely, we might be able to recovery gracefully.
> 
> It does not appear to be seg faulting. So it seemed that the program was 
> being terminated intentionally from within Petsc, which was puzzling, and why 
> I was asking about that in my previous email.
> 
> So - Chris made a great find. Turns out that right after PetscInitialize in 
> our main.cpp, we had the line:
> 
> PetscPushErrorHandler(PetscAbortErrorHandler, NULL);
> 
> Which was telling Petsc to call MPI_Abort if there was an error. I probably 
> put that line into the code years ago and forgot it was there. So, as Barry 
> said, if we change the PetscErrorHandler option to ignore, then at least we 
> can avoid the program aborting on the error, and hopefully be able to recover 
> with our existing code logic. 
> 
> Also, may have found a clue on the root cause of the error. I had thought 
> that were were checking all of our inputs to Petsc for issues such as out of 
> range index values. But I went back and see that due to a versioning mistake, 
> there is one particular error check on our inputs that was being removed from 
> production builds by a preprocessor definition. Which means that it wouldn't 
> be caught in our production builds, which means that it is possible that bad 
> inputs could have been passed into Petsc. I don't know for sure - but is 
> plausible. The missing error check was doing the following: checking to see 
> if the fourth entry to MatSetValues ("n", for the number of nonzero values in 
> the row) 
> (https://www.mcs.anl.gov/petsc/petsc-current/docs/manualpages/Mat/MatSetValues.html
>  
> <https://www.mcs.anl.gov/petsc/petsc-current/docs/manualpages/Mat/MatSetValues.html>)
>  was equal to the sum of the number of diagonal and off diagonal values that 
> we had specified in our previous call to MatMPIAIJSetPreallocation. 
> 
> So that is at least a theory for what was happening. The theory would be: 
> very rarely, due to a bug in our code, we were running MatSetValues with "n" 
> set to a value not equal to the number of nonzero values promised in the call 
> to MatMPIAIJSetPreallocation. Maybe this led to the Petsc "key 7556 is 
> greater than largest key allowed 5693" error message, and then our setting of 
> 'PetscAbortErrorHandler' was causing the program to abort.
> 
> Mark
> 
> 
> 
> 
> 
>  
> 
> On Sat, Sep 26, 2020 at 9:51 PM Barry Smith <bsm...@petsc.dev 
> <mailto:bsm...@petsc.dev>> wrote:
> 
> 
>> On Sep 26, 2020, at 5:58 PM, Junchao Zhang <junchao.zh...@gmail.com 
>> <mailto:junchao.zh...@gmail.com>> wrote:
>> 
>> 
>> 
>> On Sat, Sep 26, 2020 at 5:44 PM Mark Adams <mfad...@lbl.gov 
>> <mailto:mfad...@lbl.gov>> wrote:
>> 
>> 
>> On Sat, Sep 26, 2020 at 1:07 PM Matthew Knepley <knep...@gmail.com 
>> <mailto:knep...@gmail.com>> wrote:
>> On Sat, Sep 26, 2020 at 11:17 AM Mark McClure <m...@resfrac.com 
>> <mailto:m...@resfrac.com>> wrote:
>> Thank you, all for the explanations. 
>> 
>> Following Matt's suggestion, we'll use -g (and not use -with-debugging=0) 
>> all future compiles to all users, so in future, we can provide better 
>> information.
>> 
>> Second, Chris is going to boil our function down to minimum stub and share 
>> in case there is some subtle issue with the way the functions are being 
>> called. 
>> 
>> Third, I have question/request - Petsc is, in fact, detecting an error. As 
>> far as I can tell, this is not an uncontrolled 'seg fault'. It seems to me 
>> that maybe Petsc could choose to return out from the function when it 
>> detects this error, returning an error code, rather than dumping the core 
>> and terminating the program. If Petsc simply returned out with an error 
>> message, this would resolve the problem for us. After the Petsc call, we 
>> check for Petsc error messages. If Petsc returns an error - that's fine - we 
>> use a direct solver as a backup, and the simulation continues. So - I am not 
>> sure whether this is feasible - but if Petsc could return out with an error 
>> message - rather than dumping the core and terminating the program - then 
>> that would effectively resolve the issue for us. Would this change be 
>> possible?
>> 
>> At some level, I think it is currently doing what you want. CHKERRQ() simply 
>> returns an error code from that function call, printing an error message. 
>> Suppressing the message is harder I think,
>> 
>> He does not need this.
>>  
>> but for now, if you know what function call is causing the error, you can 
>> just catch the (ierr != 0) yourself instead of using CHKERRQ.
>> 
>> This is what I suggested earlier but maybe I was not clear enough.
>> 
>> Your code calls something like 
>> 
>> ierr = SNESSolve(....); CHKERRQ(ierr);
>> 
>> You can replace this with:
>> 
>>  ierr = SNESSolve(....); 
>>  if (ierr) {
>> How to deal with CHKERRQ(ierr); inside SNESSolve()?
> 
> 
>    PetscPushErrorHandler(PetscIgnoreErrorHandler,NULL);
> 
>    But the problem in this code's runs appear to be due to corrupt data, why 
> and how it gets corrupted is not known. Continuing with an alternative solver 
> because a solver failed for numerical or algorithmic reasons is generally 
> fine but continuing when there is corrupted data is always iffy because one 
> doesn't know how far the corruption has spread. SNESDestroy(&snes); 
> SNESCreate(&snes); may likely clean out any potentially corrupted data but if 
> the corruption got into the mesh data structures it will still be there.
> 
>    A very sophisticated library code would, when it detects this type of 
> corruption, sweep through all the data structures looking for any indications 
> of corruption, to help determine the cause and/or even fix the problem. We 
> don't have this code in place, though we could add some, because generally we 
> relay on valgrind or -malloc_debug to detect such corruption, the problem is 
> valgrind and -malloc_debug don't fit well in a production environment. 
> Handling corruption that comes up in production but not testing is difficult.
> 
>  Barry
> 
> 
> 
> 
>>     .... 
>>  }
>> 
>> I suggested something earlier to do here. Maybe call KSPView. You could even 
>> destroy the solver and start the solver from scratch and see if that works.
>> 
>> Mark
>>  
>> The drawback here is that we might not have cleaned up
>> all the state so that restarting makes sense. It should be possible to just 
>> kill the solve, reset the solver, and retry, although it is not clear to me 
>> at first glance if MPI will be in an okay state.
>> 
> 
> 
> 
> -- 
> What most experimenters take for granted before they begin their experiments 
> is infinitely more interesting than any results to which their experiments 
> lead.
> -- Norbert Wiener
> 
> https://www.cse.buffalo.edu/~knepley/ <http://www.cse.buffalo.edu/~knepley/>

Reply via email to