John Hearns <hear...@googlemail.com> writes:

> However, you could look of correctable ECC errors,

On the systems with which I'm familiar, they either won't show up in the
IPMI SEL or will apparently be inconsistent with the kernel mcelog --
mcelog typically displays many more events.  (I don't know why this is,
though I'm overly familiar with memory errors.)

> and for disks run a smartctl test and see if a disk is showing
> symtopms which might make it fail in future.

What I typically see from smartd is alerts when one or more sectors has
already gone bad, although that tends not to be something that will
clobber the running job.  How should it be configured to do better
(without noise)?
_______________________________________________
Beowulf mailing list, Beowulf@beowulf.org sponsored by Penguin Computing
To change your subscription (digest mode or unsubscribe) visit 
http://www.beowulf.org/mailman/listinfo/beowulf

Reply via email to