> On Fri, Jun 28, 2013 at 09:45:50AM +0100, Jonathan Barber wrote: > >> The problem with SSH based approaches is when you have failed nodes - >> normally they cause the entire command to hang until the attempted >> connection times out. > > Normally what people do is ping the node before trying ssh on it. And > have reasonable timeouts around both the ssh connect and the command > execution. There's no fundamental reason why this is any different > from messaging or subscription-plus-messaging.
I have found using whatsup-pingd (https://computing.llnl.gov/linux/whatsup.html) run once every minute or so, to create a list of "up nodes" and "down nodes" is very handy. You can even point pdsh WCOLL to the up nodes file. -- Doug > > -- greg > > _______________________________________________ > Beowulf mailing list, Beowulf@beowulf.org sponsored by Penguin Computing > To change your subscription (digest mode or unsubscribe) visit > http://www.beowulf.org/mailman/listinfo/beowulf > > -- > Mailscanner: Clean > -- Doug -- Mailscanner: Clean _______________________________________________ Beowulf mailing list, Beowulf@beowulf.org sponsored by Penguin Computing To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf