Re: [Beowulf] Not quite Walmart, or, living without ECC?

Scott Atchley Mon, 26 Nov 2007 13:19:28 -0800

On Nov 26, 2007, at 3:27 PM, David Mathog wrote:

I ran a little test over the Thanksgiving holiday to see how common

random errors in nonECC memory are. I used the memtest86+ bit fadetest

mode, which writes all 1s, waits 90 minutes, checks the result, then

does the same thing for all 0s. Anyway, this was the best test Icould

find for detecting the occasional gamma ray type data loss event.  The
result: no errors logged in 5 solid days of testing.  So this class of
error (the type ECC would detect and probably fix) apparently occurs
on these machines at a rate of less than 1 per 840 Gigabyte-hours.
Possibly the upper limit is half that if data can only be lost
on 1 -> 0 transition, or vice versa.  This assumes the bit fade test
works, which cannot be independently verified from these results.


On the web there are references to an IBM study which found 1 bit
error/256Mb/Month, which would have been (.25 *30 * 24) =
1 per 180 Gigabyte-hours.  If IBM's numbers held for my hardware
there should have seen 4 or 5 errors in total.  Mine are in a basement

in a concrete building, perhaps that provided some shieldingrelative to

what IBM used for their test conditions.

The memory was Corsair Twinx1024-3200C2.  When first installed all
of this memory had run for 24 hours with no errors in normal
memtest86+ testing.

Regards,

David Mathog


Or maybe you got lucky. Five days may not be long enough.

We have had customers report events that included parity errors onhundreds of nodes simultaneously on large clusters. Higher altitudemakes things worse. Being in a DOE lab near lots of interestingmaterials does not help either. :-)


Scott
_______________________________________________
Beowulf mailing list, Beowulf@beowulf.org
To change your subscription (digest mode or unsubscribe) visit 
http://www.beowulf.org/mailman/listinfo/beowulf

Re: [Beowulf] Not quite Walmart, or, living without ECC?

Reply via email to