hi rahul, same thing happens at our side.node gets reboot due to asr and it doesnt crash.can u suggest any remedy?
On Fri, Oct 23, 2009 at 6:26 AM, Rahul Nabar <rpna...@gmail.com> wrote: > I wanted to get some opinions about if watchdog timers are a good idea > or not. I came across watchdogs again when reading through my IPMI > manual. In principle it sounds neat: If the system hangs then get it > to reboot after, say, 5 minutes automatically. But, in practice, maybe > it is a terrible idea. > > Of course, one might say, a well configured HPC compute-node > shouldn't be getting to a hung point anyways; but in-practice I see a > few nodes every month that can be resurrected by a simple reboot. > Admittedly these nodes are quite senile. > > The danger, seems to me: What if a node kept crashing (due to say, a > bad HDD or something). Then a watchdog would merely keep rebooting > this node a hundred times. Not such a good thing. > > Have you guys used watchdog timers? Maybe there is a way to build a > circuit-breaker around the principle so that if a node reboots > automatically more than 3 times then watchdog gives up? > > If one had to do the watchdogging should one do the resets locally > using the IPMI local interface (hogs cpu cycles) or a central > Nagios-like system that could issue such a command. Many scenarios > seem possible. The prospect of a automated system doing a reboot at > 3am seems more tempting than me having to do this manually. > > -- > Rahul > _______________________________________________ > Beowulf mailing list, Beowulf@beowulf.org sponsored by Penguin Computing > To change your subscription (digest mode or unsubscribe) visit > http://www.beowulf.org/mailman/listinfo/beowulf >
_______________________________________________ Beowulf mailing list, Beowulf@beowulf.org sponsored by Penguin Computing To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf