No, but there seem to be a switch in the kernel module that allows to trigger
a kernel panic upon discovering uncorrectable errors.

I suspect you mean /sys/module/edac_mc/panic_on_ue
(ue = uncorrected error).  I consider this very much the norm:
it would be very strange to run with ECC memory, and ECC enabled,
and not actually halt on UE.  UE represents a failure of the memory
system, not just a transient event, but something which must be physically fixed. even for HA situations, I'd be pretty skeptical
about using a memory channel which had any UE's on it.

CE (corrected errors) OTOH, are very different. they're almost just a heartbeat of your ECC subsystem. yes, a CE indicates some event that needed correcting, but at a modest rate, CEs are acceptable. there are failure modes, though, where enough CEs eventually cause a UE: tracking CE rate is important for that reason. (other UE modes
don't have this warning sign...)

you can set CEs to log through kernel->syslog via edac tunables in /sys.

Yes, but the memory of any process might get corrupted, thus this is more to

if UE is set to panic, nothing will get corrupted (that's really the point eh?)
_______________________________________________
Beowulf mailing list, Beowulf@beowulf.org sponsored by Penguin Computing
To change your subscription (digest mode or unsubscribe) visit 
http://www.beowulf.org/mailman/listinfo/beowulf

Reply via email to