Quoting Mark Hahn <[EMAIL PROTECTED]>, on Mon 19 May 2008 08:47:46 PM PDT:

It is currently set to
Basic, which scrubs every 5.24 ms.

You'll have to look in the manual to find out what that means -- it's
probably "do a small amount of scrubbing every 5.24 ms". And you have

I expect it's the interval between cacheline-sized (64B) scrubs. as
such, I think it's much too low (4G ram in 98 hours!)


too low, based on what assumption for upset rate?

If the rate is, say, 1E-13 upset/bit/day, and you've got 1 Gbyte (roughly 1E10 bits), you're looking at 1E-3 upsets/day. Since the ECC will correct the error, what you're really fighting with the scrubbing is the probability of a *double* error in the same word. Depending on the error statistics, i.e. do you get multiple bit errors in the same word.. (unlikely with most memory layout schemes which spread words across the geometry, but, you never know...)

And if you DO get a double error, the ECC code will detect it, and you can halt or take corrective measures (i.e. throw away that work package's output, and restart from a checkpoint, etc.)


Even if the rate is much higher.. say 1E-12 upset/bit/hour.. about 200 times higher than the 1E-13 I used above. And say you've got 4Gbyte of ram.. now you're looking at a single (fully corrected) upset per day. The probability of a undetected error is still quite low (requiring at least 3 errors), and the probability of a double bit error causing an abort (within the 100 or so hours you calculated for the scrub) is probably low enough that it wouldn't materially affect your computation rate. And this assumes that your OS doesn't autoscrub on a detected Single Bit Error, perhaps because the hardware doesn't support it.


OTOH, if the ECC is protecting you from a lousy mobo design with timing glitches and crosstalk between traces manifesting as errors...



Jim

_______________________________________________
Beowulf mailing list, Beowulf@beowulf.org
To change your subscription (digest mode or unsubscribe) visit 
http://www.beowulf.org/mailman/listinfo/beowulf

Reply via email to