Re: OS support for fault tolerance

Jason Hellenthal Tue, 14 Feb 2012 09:06:36 -0800


On Tue, Feb 14, 2012 at 08:57:10AM -0800, Julian Elischer wrote:
> On 2/14/12 6:23 AM, Maninya M wrote:
> > For multicore desktop computers, suppose one of the cores fails, the
> > FreeBSD OS crashes. My question is about how I can make the OS tolerate
> > this hardware fault.
> > The strategy is to checkpoint the state of each core at specific intervals
> > of time in main memory. Once a core fails, its previous state is retrieved
> > from the main memory, and the processes that were running on it are
> > rescheduled on the remaining cores.
> >
> > I read that the OS tolerates faults in large servers. I need to make it do
> > this for a Desktop OS. I assume I would have to change the scheduler
> > program. I am using FreeBSD 9.0 on an Intel core i5 quad core machine.
> > How do I go about doing this? What exactly do I need to save for the
> > "state" of the core? What else do I need to know?
> > I have absolutely no experience with kernel programming or with FreeBSD.
> > Any pointers to good sources about modifying the source-code of FreeBSD
> > would be greatly appreciated.
> This question has always intrigued me, because I'm always amazed
> that people actually try.
>  From my viewpoint, There's really not much you can do if the core
> that is currently holding the scheduler lock fails.
> And what do you mean by 'fails"?  do you run constant diagnostics?
> how do you tell when it is failed? It'd be hard to detect that 'multiply'
> has suddenly started giving bad results now and then.
> 
> if it just "stops" then you might be able to have a watchdog that
> notices,  but what do you do when it was half way through rearranging
> a list of items? First, you have to find out that it held
> the lock for the module and then you have to find out what it had
> done and clean up the mess.
> 
> This requires rewriting many many parts of the kernel to remove
> 'transient inconsistent states". and even then, what do you do if it
> was half way through manipulating some hardware..
> 
> and when you've figured that all out, how do you cope with the
> mess it made because it was dying?
> Say for example it had started calculating bad memory offsets
> before writing out some stuff and written data out over random memory?
> 
> but I'm interested in any answers people may have
>


How about core redundancy ? effectively this would reduce the amount of
available cores in half in you spread a process to run on two cores at
the same time but with an option to adjust this per process etc... I
don't see it as unfeasable.

-- 
;s =;

pgpugcwqBhE9F.pgp
Description: PGP signature

Re: OS support for fault tolerance

Reply via email to