we replace dimms which show > 1000 corrected ECCs per day
(or any overflows, for which counts are inaccurate, or any uncorrectable
errors.)
These systems are a couple of generations old, right?
waaait a minute - I think I gave the wrong impression. we have about
13 TB of this gen hardware (yes, from 3 years ago). our observed rate
is that at a given moment, a fraction of 1% of the nodes have any EC's at
all. our vendor is happy to replace dimms that have a nontrivial rate,
and there aren't a lot of nodes that have had this done.
one interesting thing is that during a 3-year period, seems like about 1%
of nodes developed higher EC rates that disappeared when the dimms were
reseated. I wonder whether this was the result of thermal cycling...
I think I have Linux set up to record single-bit errors, and the rate
using edac? I toyed with mcelog before that, but never really got much
traction until edac came with an updated kernel.
_______________________________________________
Beowulf mailing list, Beowulf@beowulf.org
To change your subscription (digest mode or unsubscribe) visit
http://www.beowulf.org/mailman/listinfo/beowulf