----- "Greg Lindahl" <lind...@pbm.com> wrote:

> That kind of policy has a fairly high opportunity
> cost, even before you factor in linked nodes.

Well we cannot dictate to our users what they do,
we set a maximum walltime of 3 months and tell users
that they should checkpoint (if they have control of
the application and have coding skills).

> E.g. you see a system disk going bad, but the user
> will lose all their output unless the job runs for
> 4 more weeks...

We run SMART tests and the like trying to proactively
spot bad disks (and other hardware) prior to failures,
but yes, that's inevitable.

cheers,
Chris
-- 
Christopher Samuel - (03) 9925 4751 - Systems Manager
 The Victorian Partnership for Advanced Computing
 P.O. Box 201, Carlton South, VIC 3053, Australia
VPAC is a not-for-profit Registered Research Agency
_______________________________________________
Beowulf mailing list, Beowulf@beowulf.org sponsored by Penguin Computing
To change your subscription (digest mode or unsubscribe) visit 
http://www.beowulf.org/mailman/listinfo/beowulf

Reply via email to