My scheduler, Torque flags compute-nodes as "busy" when the load gets above a threshold "ideal load". My settings on 8-core compute nodes have this ideal_load set to 8 but I am wondering if this is appropriate or not?
$max_load 9.0 $ideal_load 8.0 I do understand the"ideal load = # of cores" heuristic but in at least 30% of our jobs ( if not more ) I find the load average greater than 8. Sometimes even in the 9-10 range. But does this mean there is something wrong or do I take this to be the "happy" scenario for HPC: i.e. not only are all CPU's busy but the pipeline of processes waiting for their CPU slice is also relatively full. After all, a "under-loaded" HPC node is a waste of an expensive resource? On the other hand, if there truly were something wrong with a node[*] and I was to use a high load avearage as one of the signs of impending trouble what would be a good threshold? Above what load-average on a compute node do people get actually worried? It makes sense to set PBS's default "busy" warning to that limit instead of just "8". I'm ignoring the 5/10/15 min load average distinction. I'm assuming Torque is using the most appropriate one! *e.g. runaway process, infinite loop in user code, multiple jobs accidentally assigned to some node etc. -- Rahul _______________________________________________ Beowulf mailing list, Beowulf@beowulf.org sponsored by Penguin Computing To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf