> -----Original Message-----
> From: beowulf-boun...@beowulf.org 
> [mailto:beowulf-boun...@beowulf.org] On Behalf Of Mark Hahn
> Sent: Sunday, March 29, 2009 10:11 PM
> To: ariel sabiguero yawelak
> Cc: Beowulf@beowulf.org
> Subject: Re: [Beowulf] Memory errors poll
> 
> > /Could those of you running ECC memory give me an updated figure on 
> > the number of errors detected/corrected per day per system? /
> 
> we replace dimms which show > 1000 corrected ECCs per day (or 
> any overflows, for which counts are inaccurate, or any 
> uncorrectable errors.)


That seems a remarkably high rate, for the raw memory errors. Micron quotes 
something like 100 soft errors per 1E9 device hours. (That's a FIT:failure in 
time of 100)

If I saw that rate, I'd assume that there's something seriously wrong with the 
part.

> 
> > I have an old figure of about 1 error-bit per day per system at sea 
> > level, but I would like to know if it is getting worse or better.

This is something readily available from the memory manufacturers, at the 
device level.  

Beware of random stuff you read on the web.. That is, check the date of the 
data being used in the article. Technologies change over time, pretty 
substantially, so observations about DRAM error rates in 1998 probably aren't 
applicable to DRAM error rates in 2008 (unless you happen to be using 10 year 
old memory!)

A recent paper is by Borucki, Schindlbeck and Slayman (IEEE CFP 08 RPS-CDR 46th 
ann. Intl. Rel. Physics Symp. 2008, pp482ff) comments that for modern parts, 
high energy cosmic rays are more important than alpha particles, and reports on 
measurements made on DIMMs. They blasted modern mobos in a neutron test 
facility, and then scaled for New York.   It looks like about 100-200 FIT/Gb, 
which corresponds with Micron's numbers, above.  They also looked at multibit 
and logic errors as well as simple memory cell errors.  As expected, the SEU 
rate (per bit) is going down as features get smaller, but logic error rates 
stay roughly the same.

OK.. So you got a box with, say, 4Gbyte of RAM.. That's 32 Gb, so you'd expect 
something like 5000 errors per 1E9 hours, or 5 errors per 1E6 hours.. An error 
every 200,000 hours or 22 years (if my before coffee math in my head is right)


I suspect that most "memory errors" reported for PCs (whether in clusters or 
not) are manifestations of bus timing problems, perhaps over temperature, 
rather than actual bit flips in memory.  The actual measured rate of single 
event upsets is so low



> 
> we have several thousand nodes, and most of them go for 
> months without any corrected ECCs (probably all within 200M 
> of sea level).
> _______________________________________________
> Beowulf mailing list, Beowulf@beowulf.org To change your 
> subscription (digest mode or unsubscribe) visit 
> http://www.beowulf.org/mailman/listinfo/beowulf
> 
_______________________________________________
Beowulf mailing list, Beowulf@beowulf.org
To change your subscription (digest mode or unsubscribe) visit 
http://www.beowulf.org/mailman/listinfo/beowulf

Reply via email to