On May 13, 2008, at 6:17 AM, Bogdan Costescu wrote:

On Mon, 12 May 2008, Glen Beane wrote:

I know TORQUE USED to be much better than SGE at controlling MPI type jobs.

I think that it still is, due to the long-awaited but still not existing TM support in SGE.

If you use a PBS/TORQUE aware MPI job launcher it is pretty much impossible for any of the job processes to escape control of the batch system.

Hmm, not quite true. I've had just recently several such instances where I had to kill individual processes by hand (using Torque 2.1.10). One nice thing about SGE is its use of setgroups() to set additional groups from a reserved range on the all the processes of a job; as this call is normally only available to "root", it's impossible for user processes to modify the additional groups list and escape being killed; I used SGE in the past and don't remember ever having to clean up processes by hand.

[ Please note that I'm taking here into consideration only the batch system proper and not any kind of prologue/epilogue scripts which are the usual fixes that are applied locally. IMHO job cleanup is a basic functionality that should be included in the batch system proper. ]

In TORQUE I've never had a problem with TM spawned processes not getting cleaned up, and there is no universal way for TORQUE to know about and clean up anything spawned outside of TM (such as with a ssh based MPI job launcher). If TM-spawned processes were not getting cleaned up I would log it as a bug.

_______________________________________________
Beowulf mailing list, Beowulf@beowulf.org
To change your subscription (digest mode or unsubscribe) visit 
http://www.beowulf.org/mailman/listinfo/beowulf

Reply via email to