At 01:56 PM 11/16/2007, Mark Hahn wrote:
I just asked the local NT goon, "do you use ECC for the servers?" and
he answered, "you have to". What he considers a server-class mobo
requires ECC

whether you need ECC depends on many things. first, how much memory your machine has - my experience is that most generic servers (web, file,
mail, etc) don't have much - maybe a few GB.  the chance of needing ECC
also depends on how "hard" you use the ram (again, mundane servers are pretty lightly utilized.) as well as factors like altitude, ram quality,
and the ever popular "how important is your data".

for clusters, I would say that ECC is basically a necessity, unless all
the jobs can be run in a "checking" mode (ie, perform a search or
optimization, then verify the results in case the hit was due to a bit flip.)

that said, ECC events are not all that common.  I have a 768-node cluster
here, each node dual-socket opteron with 8GB PC3200 ddr. I just checked all nodes with mcelog, and 35 have reported corrected events over roughly the last 20 days. one may have hit an uncorrectable event (but in our clusters, corrected ECC rate is not a good predictor for uncorrectable
ones...)


So the detected upset rate is:

35/(768*20) detected errors per day per computer (0.0023) or 3.3E-14 errors/bit/day

Wikipedia claims 1 error/month/GB (3E-11 errors/bit/day) but their references are all pretty ancient (a JPL paper from 2001 is probably reporting on devices that would have been used in consumer electronics in the early 90s). They may also have been talking about "upset rates", and what you observe is "detected bit error rate" (that is, you don't see all the upsets that have occurred, because you don't read all memory, all the time... your accesses may be concentrated in, say, 1GB of your overall 8GB DRAM space)

http://parts.jpl.nasa.gov/docs/CassDRAM-00.pdf discusses some possible reasons why multibit error rates and single bit error rates don't scale like you'd expect (a heavy ion can zap multiple bits at one time, so the bit errors are not uncorrelated). In spacecraft systems, often, they implement a scrubbing algorithm that systematically reads and checks each location in turn, as opposed to waiting for the processor to happen to fetch that location. That's so that you have a chance to scrub an error in a word before it takes a second hit. On Cassini, the scrubbing in the 2.5 Gbit solid state recorders is such that every word gets scrubbed about every 9 minutes. They get about 200-300 single bit errors/day. But this is, truly, ancient technology...


_______________________________________________
Beowulf mailing list, Beowulf@beowulf.org
To change your subscription (digest mode or unsubscribe) visit 
http://www.beowulf.org/mailman/listinfo/beowulf

Reply via email to