Henning Fehrmann <henning.fehrm...@aei.mpg.de> wrote: > we started monitoring the rate of correctable errors appearing in the RAM. > We also observed few uncorrectable errors. The corresponding kernel > module 'edac_core' can cause a Kernel Panic when such an event occurs, > which makes sense to avoid corrupted results.
Are you saying that now that you are monitoring you are seeing kernel panics which did not appear before? > > Is there a way to get some useful information before the kernel panics? You can get some information through netconsole, but you know that already. > In particular are we looking for the process list to find out which > user was running what before the UE errors occurred. Well, you could log process start/stops and flush them to disk or syslog them, so that at least when the system crashes it would be possible to derive a list of everything that was still running. Doubt this will help much though, since the most likely culprit is a bad stick of memory, in which case the netconsole or IPMI or MCE messages may be enough to figure out which stick is the problem. That is, whichever process triggered it is probably an innocent bystander. Regards, David Mathog mat...@caltech.edu Manager, Sequence Analysis Facility, Biology Division, Caltech _______________________________________________ Beowulf mailing list, Beowulf@beowulf.org sponsored by Penguin Computing To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf