Quoting Joe Landman <[EMAIL PROTECTED]>, on Mon 12 May 2008 06:03:56 PM PDT:

Perry E. Metzger wrote:
Joe Landman <[EMAIL PROTECTED]> writes:
I've been reading spec sheets, and they often don't tell you, which is
rather annoying. Thus my question.
I just randomly selected 2 motherboards from 2 different vendors and
on both spec sheets, they clearly defined which memory they took.

Oh, sheets will tell you that they *take* ECC memory, but long
experience says that the motherboards that actually properly do ECC
scrubbing are a subset -- some boards will accept the extra ECC bits
and do nothing with them! Generally speaking, the only reliable way


There's a difference between logic that does EDAC on read (and corrects any single bit errors), and having logic that actually writes back when an EDAC hit occurs. In most cases, the latter is under software control: that is, when you get the unlikely upset, you have some sort of interrupt routine that goes out and rewrites that general section of memory. The EDAC logic remembers what the location of the "error" was, so the ISR knows where to read/write.

It's not clear that the rewrite actually buys you much, IFF the error rate is low. That is, if you get one upset per day, you're probably safe just correcting on the read, and assuming that you'll not get a second upset *on the same word* before it changes to something else (which would rewrite the syndrome bits). OTOH, if you have an upset rate in the sub-second range (pretty unlikely, I should think), maybe some sort of active scrubbing might be useful.

The typical scrubber is tied to some sort of interrupt line or clock, and just reads/writes each location in turn. Obviously, this burns memory bus bandwidth.

There's also systems that have autoscrubbing, where the memory contents are expected to be constant over a long time, so it gets continuously rewritten (or, at least, a checksum is done, and if it fails, it rewrites from a known good copy). This is a typical scheme for SRAM based FPGAs (Xilinx) in spaceflight applications.


Just how many bit errors do people see? Even in old, very soft DRAM technology I worked with in the early 80s, we'd get maybe 1 legitimate SBE per week in a megabyte or so of DRAM.. that's with HUGE transistor sizes and very, very soft parts. We saw more errors than that during prototyping, but it was manifestations of bus timing, or timing conflicts that trashed bits.

here's a presentation on SDRAMs in spaceflight applications
http://klabs.org/richcontent/MAPLDCon02/presentations/session_p/p7_ladbury_s.ppt

There's some data on an array of 12 256 Mbit SDRAMs EDACed with upset rates in geosynchronous orbit of 4E-17 per day, 99th percentile. Thats with a raw upset rate of about 1/2 bit/day overall

http://www.maxwell.com/microelectronics/support/presentations/ESCCON_2002.pdf


Heh... you want motherboards that work?  Different question :(

I've found to determine if the ECC stuff works is to look at the BIOS
ECC settings, but often that info seems to be missing from the
manuals.

Sadly, the bios ECC settings on a number of MB's appear to be busted in
some cases ...  well, ok, the bios setup of the ECC system appears to
be busted.  At least most will signal an MCE these


There are, also, nefarious memory parts that emulate the ECC bits. Which, of course, does absolutely no good.

Anyway, I was asking for a reason. :)


_______________________________________________
Beowulf mailing list, Beowulf@beowulf.org
To change your subscription (digest mode or unsubscribe) visit 
http://www.beowulf.org/mailman/listinfo/beowulf

Reply via email to