Re: [Beowulf] running MPICH on AMD Opteron Dual Core Processor Cluster( 72 Cpu's)

Matt Allen Wed, 03 Jan 2007 09:34:23 -0800

Mark Hahn wrote:
> personally, I'm pretty convinced that MPI implementations should stay
> out of the jobstarter business, and go with straight agentless (ssh-based)
> job spawning.


I'm curious about your reasoning, Mark.  We've had nightmare situations
for years with ssh-based job spawning.  The most common case is where
sshd processes terminate on nodes without the child mpi processes
exiting.  Then we have orphaned mpi processes, owned by init, scattered
throughout the cluster.  If any of these processes are using limited
resources (like Myrinet adapters), subsequent jobs can (more likely,
will) exit immediately upon dispatch to the node.

We've found ways around this with prolog/epilog scripts, and scheduling
policy, but the slickest solutions so far, in my opinion, have been
mpiexec (admittedly not part of an MPI implementation) and lam/openmpi.
 Allowing the resource manager to completely handle job spawning has
provided better post-job cleanup, and more complete job statistics
(cpu-time, mostly) for us.

Do you not have to deal with these sorts of issues?  If not, lay some
wisdom on me; I could use it.

Matt

-- 
Matt Allen            |  Systems Analyst
[EMAIL PROTECTED]  |  Research and Technical Services
812-855-7318          |  Indiana University



_______________________________________________
Beowulf mailing list, [email protected]
To change your subscription (digest mode or unsubscribe) visit 
http://www.beowulf.org/mailman/listinfo/beowulf

Re: [Beowulf] running MPICH on AMD Opteron Dual Core Processor Cluster( 72 Cpu's)

Reply via email to