Interesting. We (and by we, I refer to my time at UC Berkeley College of Chemistry) used to implement multiple queues with various time restrictions to accomdate short, medium, long and extended run jobs. It was an honor to system to be sure, but I spent a great amount of time working with the researchers on an indvidual level to foster the trust that an honor system needs. There was also a little logic to allow submitted jobs to skew towards one end of the spectrum if the cluster was not fully utilized, and not expected to be so. Working that closely with folks also allowed us to chart cluster usage for about a month (and sometimes much more) so we can tweak cluster policy if appropriate.


It worked out for the most part, but there was the occasional scofflaw. With the trust relationship I had with the researchers, we could usually nag the scofflaws back into line.

Layer 8 issues can certainly lead to trouble, but it can also be used to your advantage!

Just a personal observation. I realize this kind of thing would not work everywhere.

-geoff


PS, sorry for any duplicate copies of this email, I am having some ISP issues this week.




Am 16.01.2008, 18:16 Uhr, schrieb Craig Tierney <[EMAIL PROTECTED]>:

Geoff wrote:


..Interesting discussion deleted..

As a funny aside, I once knew a sysadmin who applied 24 hour timelimits to all queues of all clusters he managed in order to force researchers to think about checkpoints and smart restarts. I couldn't understand why so many folks from his particular unit kept asking me about arrays inside the scheduler submission scripts and nested commends until I found that out. Unfortunately I came to the conclusion that folks in his unit were spending more time writing job submission scripts than code... well... maybe that is an exaggeration.


Our queue limits are 8 hours.  They are set this way for two reasons.
First, we have real time jobs that need to get through the queues and
we believe that allowing significantly longer jobs would block those
really important jobs.  Second, for a multi-user system, it isn't very
fair for a user to run multi-day jobs and prevent shorter jobs from getting
in.  It is about being fair.  Use the resource and then get back in line.

I know that at other US Government facilities it is common practice to
set sub-day queue limits. I recently helped setup one site that had
queue limits set at 12 hours.  Another large organization near the top
of the top 500 list does this as well.

This means that codes need check-pointing.  Although we are all waiting
for the holy grail of system level check-pointing, the odds of that being
implemented consistently across architectures AND not have a significant
performance hit is unlikely.  This means that researchers have to also be
software engineers. If they want to get real work done, adding check-pointing is one of the steps. As one operations manager at a major HPC site once said
to me 'codes that don't support check-pointing aren't real codes'.

Allowing users to run for days or weeks as SOP is begging for failure.
Did that sysadmin who set 24 hour time limits ever analyze the amount
of lost computational time because of larger time limits?

Craig




--
-------------------------------
Geoff Galitz, [EMAIL PROTECTED]
Blankenheim, Deutschland
_______________________________________________
Beowulf mailing list, Beowulf@beowulf.org
To change your subscription (digest mode or unsubscribe) visit 
http://www.beowulf.org/mailman/listinfo/beowulf

Reply via email to