On Thu, Mar 1, 2012 at 2:20 AM, Jan Wender <j.wen...@science-computing.de> wrote: > AFAIK at least LSF has this as a feature called preemption.
IMO, LSF has the best job preemption & checkpointing support, with the least integration effort needed from the end user & cluster administrator. And resource preemption and license preemption are the more advanced features of LSF. (There are more manual configuration needed for Grid Engine & Open Grid Scheduler and/or other batch systems - not impossible, but needs knowledge on how to tune the scheduler.) > That depends probably mostly on the application. If the application offers > it, then the batch system can use it to save state. > I don't know much about kernel level checkpointing, though. There are 3 types of checkpointing supported by LSF: 1) kernel-level 2) user-level 3) application-level Kernel level is easy, the OS kernel handles everything for the user (for interactively processes) & the batch system (for jobs). However, only IRIX, Cray UNICOS, and NEC SUPER-UX support kernel-level checkpointing. On Linux, you usually need to patch the kernel: - "Checkpoint/restart: it's complicated": http://lwn.net/Articles/414264/ - "Kernel-based checkpoint and restart": http://lwn.net/Articles/293575/ (Lots of discussions on kernel-level checkpointing in the past few years but still we don't have anything in the official tree yet...) Or even kernel assisted user-level checkpointing: - "Preparing for user-space checkpoint/restore": http://lwn.net/Articles/478111/ And there is also the famous Berkeley Lab Checkpoint/Restart (BLCR), which is a kernel module and thus you can use your distribution's stock kernel: - "RCE 12: BLCR": http://www.rce-cast.com/Podcast/rce-12-blcr.html - "Checkpointing under Linux with Berkeley Lab Checkpoint/Restart": http://gridscheduler.sourceforge.net/howto/APSTC-TB-2004-005.pdf For user-level, you will need to link against a checkpointing library shipped with LSF, which (I think) has some object file level init routines that perform initializations to properly save the state of stuff and also need to wrap around standard libc functions & system calls (I forgot the actual details, lots of academic papers published 15 years ago and I recall reading a few of them, but just don't recall the content :-D ). See "Standalone Checkpointing": http://research.cs.wisc.edu/condor/checkpointing.html With user-level checkpointing & restart, you usually need to relink your application (unless you use the LD_PRELOAD trick). So for operating systems that don't support kernel-level checkpointing (ie. most of the OSes), user-level checkpointing usually works for most general applications (I *think* Platform Computing even ported the LSF checkpointing library to Windows as well - or at least that's what I was told). For application-level checkpointing, the applications will handle everything. But of course each application needs to have its own built-in support for checkpoint & restart. Rayson ================================= Open Grid Scheduler / Grid Engine http://gridscheduler.sourceforge.net/ Scalable Grid Engine Support Program http://www.scalablelogic.com/ > > Cheerio, Jan > -- > ---- Company Information ---- > Vorstand/Board of Management: Dr. Bernd Finkbeiner, Michael Heinrichs, > Dr. Roland Niemeier, Dr. Arno Steitz, Dr. Ingrid Zech > Vorsitzender des Aufsichtsrats/Chairman of the Supervisory Board: > Philippe Miltin > Sitz/Registered Office: Tuebingen Registergericht/Registration Court: > Stuttgart > Registernummer/Commercial Register No.: HRB 382196 > > > _______________________________________________ > Beowulf mailing list, Beowulf@beowulf.org sponsored by Penguin Computing > To change your subscription (digest mode or unsubscribe) visit > http://www.beowulf.org/mailman/listinfo/beowulf > _______________________________________________ Beowulf mailing list, Beowulf@beowulf.org sponsored by Penguin Computing To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf