Re: [Beowulf] ECC support on motherboards?

Jim Lux Tue, 13 May 2008 15:15:31 -0700

At 02:16 PM 5/13/2008, Håkon Bugge wrote:

At 19:17 13.05.2008, Perry E. Metzger wrote:

So another question is, how can you reliably test any of this stuff?
It isn't like you can reliably induce single bit errors and see if the
hardware catches them. (A special memory module that let you test
would be a wonderful thing, but I've never even heard of such a thing.)

We in the space hardware business do this all thetime. (but, then, we're not building consumerpriced stuff, by any means). Typically, what itmeans is that you provide a way to bypass theEDAC logic on writes or reads (which implies thatthe syndrome bits need some way to beaddressed). You can write without EDAC, and then read with, or vice versa.


We've also used dual port memories, where the second port is for diagnostics.

Another approach is error injection at the datalines (there are logic analyzers that can do this).

A bigger issue for most computers is upsets ofconfiguration control bits of one sort oranother. Unlike program and data memory, which isoften being overwritten regularly anyway,configuration bits tend to get set once atstartup/initialization, and then neverchanged. A particular problem if the bitcontrols whether a pin on a device is an input or output.

Well, you can trust the HW vs, the firmware.Further, for some chipsets it is possible tosimply stop the memory refresh for some time(~1 minute) while the system is idle. Afterthis, you enable it again, and you should seesingle and/or double bit errors. Thisenabling/disabling through setpci or other. Ifyou do not see errors after this, you can try to explain why...

Maybe, maybe not. I wouldn't want to depend onthe non-refreshed behavior of a refresh part,simply because it's undefined. Not refreshingmight lead to bit errors, it might not (maybeit's internally refreshed, maybe its MRAM or Static Ram, masquerading as DRAM)

Once I wrote tool which examined all settings ofa particular chipset. That raised numerous questions to the vendor.



Hakon

I'm doing the planning for a new cluster and the whole thing is
remarkably bothersome. You can't easily figure out what motherboards
will even pretend to do ECC that easily, you can't easily check once
you have a sample motherboard in hand. It isn't even easy to get ECC
memory for more modern standards. I'm starting to wonder if doing all
calculations twice, once on each of two machines, isn't easier, but it
seems utterly wrong to do that...

Perry


--
Håkon Bugge
CTO
mob. +47 92 48 45 14
off. +47 92 44 81 11
fax. +47 22 23 36 66
[EMAIL PROTECTED]
Skype: hakon_bugge

Scali - http://www.scali.com
Higher Performance Computing


_______________________________________________
Beowulf mailing list, Beowulf@beowulf.org

To change your subscription (digest mode orunsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf




_______________________________________________
Beowulf mailing list, Beowulf@beowulf.org
To change your subscription (digest mode or unsubscribe) visit 
http://www.beowulf.org/mailman/listinfo/beowulf

Re: [Beowulf] ECC support on motherboards?

Reply via email to