Mark Hahn wrote: > personally, I'm pretty convinced that MPI implementations should stay > out of the jobstarter business, and go with straight agentless (ssh-based) > job spawning.
I'm curious about your reasoning, Mark. We've had nightmare situations for years with ssh-based job spawning. The most common case is where sshd processes terminate on nodes without the child mpi processes exiting. Then we have orphaned mpi processes, owned by init, scattered throughout the cluster. If any of these processes are using limited resources (like Myrinet adapters), subsequent jobs can (more likely, will) exit immediately upon dispatch to the node. We've found ways around this with prolog/epilog scripts, and scheduling policy, but the slickest solutions so far, in my opinion, have been mpiexec (admittedly not part of an MPI implementation) and lam/openmpi. Allowing the resource manager to completely handle job spawning has provided better post-job cleanup, and more complete job statistics (cpu-time, mostly) for us. Do you not have to deal with these sorts of issues? If not, lay some wisdom on me; I could use it. Matt -- Matt Allen | Systems Analyst [EMAIL PROTECTED] | Research and Technical Services 812-855-7318 | Indiana University _______________________________________________ Beowulf mailing list, [email protected] To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf
