Re: [Beowulf] Not quite Walmart, or, living without ECC?

Jim Lux Fri, 16 Nov 2007 15:40:26 -0800

At 01:56 PM 11/16/2007, Mark Hahn wrote:

I just asked the local NT goon, "do you use ECC for the servers?" and
he answered, "you have to". What he considers a server-class mobo
requires ECC
whether you need ECC depends on many things. first, how much memoryyour machine has - my experience is that most generic servers (web, file,
mail, etc) don't have much - maybe a few GB.  the chance of needing ECC
also depends on how "hard" you use the ram (again, mundane serversare pretty lightly utilized.) as well as factors like altitude, ram quality,
and the ever popular "how important is your data".

for clusters, I would say that ECC is basically a necessity, unless all
the jobs can be run in a "checking" mode (ie, perform a search or
optimization, then verify the results in case the hit was due to a bit flip.)

that said, ECC events are not all that common.  I have a 768-node cluster
here, each node dual-socket opteron with 8GB PC3200 ddr. I justchecked all nodes with mcelog, and 35 have reported corrected eventsover roughlythe last 20 days. one may have hit an uncorrectable event (but inour clusters, corrected ECC rate is not a good predictor for uncorrectable
ones...)



So the detected upset rate is:

35/(768*20) detected errors per day per computer (0.0023) or 3.3E-14errors/bit/day

Wikipedia claims 1 error/month/GB (3E-11 errors/bit/day) but theirreferences are all pretty ancient (a JPL paper from 2001 is probablyreporting on devices that would have been used in consumerelectronics in the early 90s). They may also have been talking about"upset rates", and what you observe is "detected bit error rate"(that is, you don't see all the upsets that have occurred, becauseyou don't read all memory, all the time... your accesses may beconcentrated in, say, 1GB of your overall 8GB DRAM space)

http://parts.jpl.nasa.gov/docs/CassDRAM-00.pdf discusses somepossible reasons why multibit error rates and single bit error ratesdon't scale like you'd expect (a heavy ion can zap multiple bits atone time, so the bit errors are not uncorrelated). In spacecraftsystems, often, they implement a scrubbing algorithm thatsystematically reads and checks each location in turn, as opposed towaiting for the processor to happen to fetch that location. That'sso that you have a chance to scrub an error in a word before it takesa second hit. On Cassini, the scrubbing in the 2.5 Gbit solid staterecorders is such that every word gets scrubbed about every 9minutes. They get about 200-300 single bit errors/day. But this is,truly, ancient technology...



_______________________________________________
Beowulf mailing list, Beowulf@beowulf.org
To change your subscription (digest mode or unsubscribe) visit 
http://www.beowulf.org/mailman/listinfo/beowulf

Re: [Beowulf] Not quite Walmart, or, living without ECC?

Reply via email to