On Thu, Jan 17, 2008 at 02:53:36PM +0100, Bogdan Costescu wrote: > On Wed, 16 Jan 2008, Craig Tierney wrote: > > >Our queue limits are 8 hours. > >... > >Did that sysadmin who set 24 hour time limits ever analyze the amount > >of lost computational time because of larger time limits? > > While I agree with the idea and reasons of short job runtime limits, I > disagree with your formulation. Being many times involved in > discussions about what runtime limits should be set, I wouldn't make > myself a statement like yours; I would say instead: YMMV. In other > words: choose what fits better the job mix that users are actually > running. If you have determined that 8h max. runtime is appropriate > for _your_ cluster and increasing it to 24h would lead to a waste of > computational time due to the reliability of _your_ cluster, then > you've done your job well. But saying that everybody should use this > limit is wrong.
Completely agree. > Furthermore, although you mention that system-level checkpointing is > associated with a performance hit, you seem to think that user-level > checkpointing is a lot lighter, which is most often not the case. Hmmm. A system level checkpoint must save the complete state of the process to be checkpointed plus all of its siblings/children plus varying amounts of external state; a machine level checkpoint must save complete machine(s) state. A user level checkpoint need only save the data that define the current state--that could well be a small set of values. Having written that, it may be *easier* (even cheaper) to expend the resources to save the complete state than to restructure some suitably complex code to expose a restart state. I certainly know an application that fits that model during most of its runtime. But, at the end of the day, that is just trading runtime for design/coding/validation time and the notion's validity depends on which side of the operation you sit. Consider this though, if as an admin you only rely on user- level checkpoint, you *will* end up with an argument from one or more users about the maximum runtime at some point; with a system (or machine) checkpoint, you'll likely avoid a lot of agida[1], especially when unplanned or emergency outages/reprioritzations occur. > Apart from the obvious I/O limitations that could restrict saving & > loading of checkpointing data, there are applications for which > developers have chosen to not store certain data but recompute it > every time it is needed because the effort of saving, storing & > loading it is higher than the computational effort of recreating it - > but this most likely means that for each restart of the application > this data has to be recomputed. And smaller max. runtimes mean more > restarts needed to reach the same total runtime... As you note, only the application can know that it's easier to recompute than save and restore. I suspect many of us can site specific examples where it's easier to recompute; some could probably also cite cases where recomputing is faster too... [1] Hearburn, indigestion, general upset or agitation. -- David N. Lombard, Intel, Irvine, CA I do not speak for Intel Corporation; all comments are strictly my own. _______________________________________________ Beowulf mailing list, Beowulf@beowulf.org To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf