Re: [Beowulf] Re: Time limits in queues

Craig Tierney Thu, 17 Jan 2008 11:16:59 -0800

Bogdan Costescu wrote:

On Wed, 16 Jan 2008, Craig Tierney wrote:
Our queue limits are 8 hours.
...
Did that sysadmin who set 24 hour time limits ever analyze the amount
of lost computational time because of larger time limits?
While I agree with the idea and reasons of short job runtime limits, Idisagree with your formulation. Being many times involved in discussionsabout what runtime limits should be set, I wouldn't make myself astatement like yours; I would say instead: YMMV. In other words: choosewhat fits better the job mix that users are actually running. If youhave determined that 8h max. runtime is appropriate for _your_ clusterand increasing it to 24h would lead to a waste of computational time dueto the reliability of _your_ cluster, then you've done your job well.But saying that everybody should use this limit is wrong.


First all I agree that it is always a YMMV case.  We good about that here (the 
list).
My point was, that in every instance that I have seen, multi-day queue
limits are not the norm.  Those places do have exceptions for particular codes
and particular projects.    I know our system would handle 24h queues in terms
of reliability, but with the job mix we have, it would cause problems beyond 
stability
(we are currently looking at a new scheduler to solve that problem).

Furthermore, although you mention that system-level checkpointing isassociated with a performance hit, you seem to think that user-levelcheckpointing is a lot lighter, which is most often not the case.



There was an assumption in my statement that I didn't share with people.
I was thinking about system-level checkpointing that will probably work
for clusters which will be some sort of VM based solution.  That will
have the overhead of the virtual machine as well as moving the data when
the time comes.

Apartfrom the obvious I/O limitations that could restrict saving & loading ofcheckpointing data, there are applications for which developers havechosen to not store certain data but recompute it every time it isneeded because the effort of saving, storing & loading it is higher thanthe computational effort of recreating it - but this most likely meansthat for each restart of the application this data has to be recomputed.


Yes, but didn't you just say the recomputing that data are faster than the
IO time associated with reading it?  A checkpoint isn't model results.  A 
checkpoint
is a state of the model at a particular time, so in this case you would save
that data.  Its already in memory, you just need to write it out with every
other bit of relevant information.  No extra needed computations.

And smaller max. runtimes mean more restarts needed to reach the sametotal runtime...


Yes, anytime you are doing something other than the model run (like 
checkpointing)
your run will take longer.   This is another one of those "it depends" scenario.
If the runtime takes 1% longer, and it makes the other users happier or lessens
the loss due to an eventual crash, is it worth it?

The 1% number is a target I would design for, based on the workload we 
experience
(multitude of different sized jobs, not one big job).  I would buy a couple of 
nodes with 3ware
cards and run either Lustre or PVFS2 over it for a place to dump the 
checkpoints.  The
filesystem would be mostly volatile (so redundancy wouldn't be critical), and 
would
more than meet the reliability requirements of my system (>97%).

Craig

--
Craig Tierney ([EMAIL PROTECTED])
_______________________________________________
Beowulf mailing list, Beowulf@beowulf.org
To change your subscription (digest mode or unsubscribe) visit 
http://www.beowulf.org/mailman/listinfo/beowulf

Re: [Beowulf] Re: Time limits in queues

Reply via email to