On 04/03/16 06:40, Douglas Eadline wrote: > Yes, failure needs to be option.
The Slurm folks have been working on failure management support for a little while, the idea being you can have a pool of spare nodes to pick from (or alternatively bargain with a scheduler for a node that's currently busy to come free later on and then add it to the job, potentially extending the walltime to make up for the shortfall). A better description from someone with higher caffeination is here: http://slurm.schedmd.com/nonstop.html All the best, Chris -- Christopher Samuel Senior Systems Administrator VLSCI - Victorian Life Sciences Computation Initiative Email: [email protected] Phone: +61 (0)3 903 55545 http://www.vlsci.org.au/ http://twitter.com/vlsci _______________________________________________ Beowulf mailing list, [email protected] sponsored by Penguin Computing To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf
