Re: [Beowulf] using watchdog timers to reboot a hung systemautomagically: Good idea or bad?

Gerry Creager Fri, 23 Oct 2009 13:49:56 -0700

Greg Lindahl wrote:

On Fri, Oct 23, 2009 at 01:01:05PM -0500, Rahul Nabar wrote:

2. Some errors are hardware precipitated. Aging, out-of-warranty
aging, hardware can sometimes need such a reboot compromise for
one-off random errors.

Maybe all the "nice" clusters out there never have this issue but for
me it is fairly common. Just confessing.


Why, exactly, are you assuming that your freezes are one-off random
errors due to aging hardware? Sounds like you're either guessing, or
you _are_ doing forensics, but aren't calling it forensics.

*MY* aging hardware usually just falls over dead when it's done with itsuseful life. Too many intermittent errors/failures causes me to dosufficient diagnostics to repair the node (if it's cheap and easyenough) or drop it in the latest surplus run.

--
Gerry Creager
AATLT, Texas A&M University     Tel: 979.862.3982
1700 Research Pkwy, Ste 160     Fax: 979.862.3983
College Station, TX             Cell 979.229.5301
   77843-3139         http://mesonet.tamu.edu
_______________________________________________
Beowulf mailing list, Beowulf@beowulf.org sponsored by Penguin Computing
To change your subscription (digest mode or unsubscribe) visit 
http://www.beowulf.org/mailman/listinfo/beowulf

Re: [Beowulf] using watchdog timers to reboot a hung systemautomagically: Good idea or bad?

Reply via email to