Re: [Beowulf] ECC support on motherboards?

Jim Lux Mon, 12 May 2008 19:49:12 -0700

Quoting Joe Landman <[EMAIL PROTECTED]>, on Mon 12 May2008 06:03:56 PM PDT:

Perry E. Metzger wrote:

Joe Landman <[EMAIL PROTECTED]> writes:

I've been reading spec sheets, and they often don't tell you, which is
rather annoying. Thus my question.

I just randomly selected 2 motherboards from 2 different vendors and
on both spec sheets, they clearly defined which memory they took.


Oh, sheets will tell you that they *take* ECC memory, but long
experience says that the motherboards that actually properly do ECC
scrubbing are a subset -- some boards will accept the extra ECC bits
and do nothing with them! Generally speaking, the only reliable way

There's a difference between logic that does EDAC on read (andcorrects any single bit errors), and having logic that actually writesback when an EDAC hit occurs. In most cases, the latter is undersoftware control: that is, when you get the unlikely upset, you havesome sort of interrupt routine that goes out and rewrites that generalsection of memory. The EDAC logic remembers what the location of the"error" was, so the ISR knows where to read/write.

It's not clear that the rewrite actually buys you much, IFF the errorrate is low. That is, if you get one upset per day, you're probablysafe just correcting on the read, and assuming that you'll not get asecond upset *on the same word* before it changes to something else(which would rewrite the syndrome bits). OTOH, if you have an upsetrate in the sub-second range (pretty unlikely, I should think), maybesome sort of active scrubbing might be useful.

The typical scrubber is tied to some sort of interrupt line or clock,and just reads/writes each location in turn. Obviously, this burnsmemory bus bandwidth.

There's also systems that have autoscrubbing, where the memorycontents are expected to be constant over a long time, so it getscontinuously rewritten (or, at least, a checksum is done, and if itfails, it rewrites from a known good copy). This is a typical schemefor SRAM based FPGAs (Xilinx) in spaceflight applications.

Just how many bit errors do people see? Even in old, very soft DRAMtechnology I worked with in the early 80s, we'd get maybe 1 legitimateSBE per week in a megabyte or so of DRAM.. that's with HUGE transistorsizes and very, very soft parts. We saw more errors than that duringprototyping, but it was manifestations of bus timing, or timingconflicts that trashed bits.


here's a presentation on SDRAMs in spaceflight applications
http://klabs.org/richcontent/MAPLDCon02/presentations/session_p/p7_ladbury_s.ppt

There's some data on an array of 12 256 Mbit SDRAMs EDACed with upsetrates in geosynchronous orbit of 4E-17 per day, 99th percentile. Thatswith a raw upset rate of about 1/2 bit/day overall


http://www.maxwell.com/microelectronics/support/presentations/ESCCON_2002.pdf


Heh... you want motherboards that work?  Different question :(

I've found to determine if the ECC stuff works is to look at the BIOS
ECC settings, but often that info seems to be missing from the
manuals.


Sadly, the bios ECC settings on a number of MB's appear to be busted in
some cases ...  well, ok, the bios setup of the ECC system appears to
be busted.  At least most will signal an MCE these

There are, also, nefarious memory parts that emulate the ECC bits.Which, of course, does absolutely no good.


Anyway, I was asking for a reason. :)


_______________________________________________
Beowulf mailing list, Beowulf@beowulf.org
To change your subscription (digest mode or unsubscribe) visit 
http://www.beowulf.org/mailman/listinfo/beowulf

Re: [Beowulf] ECC support on motherboards?

Reply via email to