Gus' numbers makes sense to me. I assume his workload consists of multiple sized jobs, serial, modest parallel, and parallel jobs using all resources. Without pre-emptive scheduling, the batch queue system has to starve the system in order to run the larger jobs.
unless backfill can utilize those temporarily idle cpus.
Obviously, before a job which consumes all resources starts , then all resources have to be idle. Which means no jobs can't be scheduled, even though they're idle.
true enough, but does depend on the size of large, high-prio jobs relative to the size of the cluster.
Another interesting metric is of course how many of the jobs runs to successful completion, i.e., are not killed due to resource limits, or crashes, or for other reasons. That's what I call net vs. gross utilization.
surely this survival rate is quite high, no? again, it depends largely on the design of the cluster (I see few node crashes, maybe 1 of 768 nodes per week, and few resource crashes (perhaps a couple buggy jobs per week)) _______________________________________________ Beowulf mailing list, Beowulf@beowulf.org To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf