we replace dimms which show > 1000 corrected ECCs per day
(or any overflows, for which counts are inaccurate, or any uncorrectable
errors.)

These systems are a couple of generations old, right?

waaait a minute - I think I gave the wrong impression.  we have about
13 TB of this gen hardware (yes, from 3 years ago).  our observed rate
is that at a given moment, a fraction of 1% of the nodes have any EC's at
all.  our vendor is happy to replace dimms that have a nontrivial rate,
and there aren't a lot of nodes that have had this done.

one interesting thing is that during a 3-year period, seems like about 1% of nodes developed higher EC rates that disappeared when the dimms were reseated. I wonder whether this was the result of thermal cycling...

I think I have Linux set up to record single-bit errors, and the rate

using edac?  I toyed with mcelog before that, but never really got much
traction until edac came with an updated kernel.
_______________________________________________
Beowulf mailing list, Beowulf@beowulf.org
To change your subscription (digest mode or unsubscribe) visit 
http://www.beowulf.org/mailman/listinfo/beowulf

Reply via email to