Hi David, Thank you for the response.
> Carsten Aulbert wrote > > > Are you saying that now that you are monitoring you are seeing kernel > > > panics which did not appear before? > > > > > > > No, but there seem to be a switch in the kernel module that allows to > trigger > > a kernel panic upon discovering uncorrectable errors. > > By "switch" do you mean: > A. There is an option that may be set when that module is loaded which > will then cause it to panic on an uncorrectable error, where normally it > would not. > B. There has been a change in the module code between kernel versions > that causes it to panic now on events where it formerly did not panic. It is A. There is a module parameter for edac_core: edac_mc_panic_on_ue=1. We have not tested it yet since uncorrectable errors rarely occur. > > > > You can get some information through netconsole, but you know that > already. > > > > > > > Yup already running, question is if a kernel panic would also be fully > visible > > via netconsole - we are glad that we rarely have those ;) > > I have seen one kernel panic since turning on netconsole, and it did log > across the network and showed up in /var/log/messages as it was supposed > to, with the same information presented as in the tests. Limited data, > but it would seem the answer is "at least sometimes". I got a hint from one of the kernel developer. Including the show show_state() function into panic.c right before dump_stack() should give process information via printk which could be collected with netconsole. We are still waiting for an UE event. > > > Yes, but the memory of any process might get corrupted, thus this is > more to > > learn which user is currently running jobs. Which in turn enables us > to notify > > these users that this particular machine running these jobs had a > problem and > > the user might need to re-run her jobs to prevent "false" data > entering her > > job. > > If the node blows up presumably the output of all the jobs currently > running there will clearly indicate that there was a failure - so you > should not have to notify those users since they will see the problem in > their results. (Unless MPI, or PVM, or whatever is being used to spread > jobs around, ignores fatal errors, which should never be the case.) For > jobs which completed earlier on the same node, this would have been > before an uncorrectable error took place, so the results should be OK. Yes, this is correct. A panic should be enough to avoid corrupted data. Often, jobs are failing for other reasons. A process list might help us to exclude other possibilities for job failure. It makes the work a bit more convenient. Cheers, Henning _______________________________________________ Beowulf mailing list, Beowulf@beowulf.org sponsored by Penguin Computing To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf