Re: OS support for fault tolerance

Julian Elischer Tue, 14 Feb 2012 15:00:46 -0800

On 2/14/12 9:27 AM, Rayson Ho wrote:

On Tue, Feb 14, 2012 at 11:57 AM, Julian Elischer<[email protected]>  wrote:

but I'm interested in any answers people may have

The way other OSes handle this is by detecting any abnormal amounts of
faults (sometimes it's not the fault of the hardware - eg. when a
partical from the outerspace hits a core and flips the bit), then the
disable the core(s).


Solaris&  mainframe (z/OS) handle it this way, but you should google
and find more info since I don't remember all the details.

Also, see this presentation: "Getting to know the Solaris Fault
Management Architecture (FMA)":
http://www.prefetch.net/presentations/SolarisFaultManagement_Presentation.pdf

True, but you can't guarantee that a cpu is going to fail in a waythat you can detect like that.what if the clock just stops.. I believe that even those systems thatsupport cpu deactivation onerror only catch some percentage of the problems, and that sometimesit was more of

"bring up the system without cpu X after it all crashed in flames".

tandem and other systems in the old day s used to be able to cope withdying cpus pretty wellbut they had support from to to bottom and the software was writtenwith 'clustering' in mind.

Rayson

=================================
Open Grid Scheduler / Grid Engine
http://gridscheduler.sourceforge.net/

Scalable Grid Engine Support Program
http://www.scalablelogic.com/

_______________________________________________
[email protected] mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-hackers
To unsubscribe, send any mail to "[email protected]"

_______________________________________________
[email protected] mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-hackers
To unsubscribe, send any mail to "[email protected]"


_______________________________________________
[email protected] mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-hackers
To unsubscribe, send any mail to "[email protected]"

Re: OS support for fault tolerance

Reply via email to