On Thu, Oct 22, 2009 at 5:56 PM, Rahul Nabar <rpna...@gmail.com> wrote:
> I wanted to get some opinions about if watchdog timers are a good idea > or not. I came across watchdogs again when reading through my IPMI > manual. In principle it sounds neat: If the system hangs then get it > to reboot after, say, 5 minutes automatically. But, in practice, maybe > it is a terrible idea. > > Of course, one might say, a well configured HPC compute-node > shouldn't be getting to a hung point anyways; but in-practice I see a > few nodes every month that can be resurrected by a simple reboot. > Admittedly these nodes are quite senile. > > Some BIOS's have a setting for this, times to reboot before quitting. > The danger, seems to me: What if a node kept crashing (due to say, a > bad HDD or something). Then a watchdog would merely keep rebooting > this node a hundred times. Not such a good thing. > > Have you guys used watchdog timers? Maybe there is a way to build a > circuit-breaker around the principle so that if a node reboots > automatically more than 3 times then watchdog gives up? > You could also do something at the system level to prevent it. If the system boots and the previous_uptime is less that one hour shut down the system. The WD timer will not wake it up. > > If one had to do the watchdogging should one do the resets locally > using the IPMI local interface (hogs cpu cycles) or a central > Nagios-like system that could issue such a command. Many scenarios > seem possible. The prospect of a automated system doing a reboot at > 3am seems more tempting than me having to do this manually. > > Also almost all systems that can do this also send out a page and an email on the event, so someone will know about it. Ed > -- > Rahul > _______________________________________________ > Beowulf mailing list, Beowulf@beowulf.org sponsored by Penguin Computing > To change your subscription (digest mode or unsubscribe) visit > http://www.beowulf.org/mailman/listinfo/beowulf >
_______________________________________________ Beowulf mailing list, Beowulf@beowulf.org sponsored by Penguin Computing To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf