No, but there seem to be a switch in the kernel module that allows to trigger
a kernel panic upon discovering uncorrectable errors.
I suspect you mean /sys/module/edac_mc/panic_on_ue
(ue = uncorrected error). I consider this very much the norm:
it would be very strange to run with ECC memory, and ECC enabled,
and not actually halt on UE. UE represents a failure of the memory
system, not just a transient event, but something which must be
physically fixed. even for HA situations, I'd be pretty skeptical
about using a memory channel which had any UE's on it.
CE (corrected errors) OTOH, are very different. they're almost just
a heartbeat of your ECC subsystem. yes, a CE indicates some event
that needed correcting, but at a modest rate, CEs are acceptable.
there are failure modes, though, where enough CEs eventually cause
a UE: tracking CE rate is important for that reason. (other UE modes
don't have this warning sign...)
you can set CEs to log through kernel->syslog via edac tunables in /sys.
Yes, but the memory of any process might get corrupted, thus this is more to
if UE is set to panic, nothing will get corrupted (that's really the point eh?)
_______________________________________________
Beowulf mailing list, Beowulf@beowulf.org sponsored by Penguin Computing
To change your subscription (digest mode or unsubscribe) visit
http://www.beowulf.org/mailman/listinfo/beowulf