On May 13, 2008, at 6:17 AM, Bogdan Costescu wrote:
On Mon, 12 May 2008, Glen Beane wrote:
I know TORQUE USED to be much better than SGE at controlling MPI
type jobs.
I think that it still is, due to the long-awaited but still not
existing TM support in SGE.
If you use a PBS/TORQUE aware MPI job launcher it is pretty much
impossible for any of the job processes to escape control of the
batch system.
Hmm, not quite true. I've had just recently several such instances
where I had to kill individual processes by hand (using Torque
2.1.10). One nice thing about SGE is its use of setgroups() to set
additional groups from a reserved range on the all the processes of
a job; as this call is normally only available to "root", it's
impossible for user processes to modify the additional groups list
and escape being killed; I used SGE in the past and don't remember
ever having to clean up processes by hand.
[ Please note that I'm taking here into consideration only the
batch system proper and not any kind of prologue/epilogue scripts
which are the usual fixes that are applied locally. IMHO job
cleanup is a basic functionality that should be included in the
batch system proper. ]
In TORQUE I've never had a problem with TM spawned processes not
getting cleaned up, and there is no universal way for TORQUE to know
about and clean up anything spawned outside of TM (such as with a ssh
based MPI job launcher). If TM-spawned processes were not getting
cleaned up I would log it as a bug.
_______________________________________________
Beowulf mailing list, Beowulf@beowulf.org
To change your subscription (digest mode or unsubscribe) visit
http://www.beowulf.org/mailman/listinfo/beowulf