Carsten Aulbert wrote > > Are you saying that now that you are monitoring you are seeing kernel > > panics which did not appear before? > > > > No, but there seem to be a switch in the kernel module that allows to trigger > a kernel panic upon discovering uncorrectable errors.
By "switch" do you mean: A. There is an option that may be set when that module is loaded which will then cause it to panic on an uncorrectable error, where normally it would not. B. There has been a change in the module code between kernel versions that causes it to panic now on events where it formerly did not panic. > > You can get some information through netconsole, but you know that already. > > > > Yup already running, question is if a kernel panic would also be fully visible > via netconsole - we are glad that we rarely have those ;) I have seen one kernel panic since turning on netconsole, and it did log across the network and showed up in /var/log/messages as it was supposed to, with the same information presented as in the tests. Limited data, but it would seem the answer is "at least sometimes". > Yes, but the memory of any process might get corrupted, thus this is more to > learn which user is currently running jobs. Which in turn enables us to notify > these users that this particular machine running these jobs had a problem and > the user might need to re-run her jobs to prevent "false" data entering her > job. If the node blows up presumably the output of all the jobs currently running there will clearly indicate that there was a failure - so you should not have to notify those users since they will see the problem in their results. (Unless MPI, or PVM, or whatever is being used to spread jobs around, ignores fatal errors, which should never be the case.) For jobs which completed earlier on the same node, this would have been before an uncorrectable error took place, so the results should be OK. Or am I missing something? Regards, David Mathog mat...@caltech.edu Manager, Sequence Analysis Facility, Biology Division, Caltech _______________________________________________ Beowulf mailing list, Beowulf@beowulf.org sponsored by Penguin Computing To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf