Re: [Beowulf] Intel MPI 2.0 mpdboot and large clusters, slow tostart up, sometimes not at all

M J Harvey Wed, 04 Oct 2006 21:56:43 -0700

Hello,

We are going through a similar experience at one of our customer sites.
They are trying to run Intel MPI on more than 1,000 nodes.  Are you
experiencing problems starting the MPD ring?  We noticed it takes a
really long time especially when the node count is large.  It also just
doesn't work sometimes.

I've had similar problems with slow and unreliable startup of the Intelmpd ring. I noticed that before spawning the individual mpds, itconnects to each node and checks the version of the installed python(function getversionpython() in mpdboot.py). On my cluster, at least,this check was very slow (not to say pointless). Removing itdramatically improved startup time - now it's merely slow.

Also, for jobs with large process counts, it's worth increasingrecvTimeout in mpirun from 20 seconds. This value governs the amount oftime mpirun waits for the secondary mpi processes to be spawned by theremote mpds and the default value is much too aggressive for large jobsstarted via ssh.


Kind Regards,

Matt

_______________________________________________
Beowulf mailing list, Beowulf@beowulf.org
To change your subscription (digest mode or unsubscribe) visit 
http://www.beowulf.org/mailman/listinfo/beowulf

Re: [Beowulf] Intel MPI 2.0 mpdboot and large clusters, slow tostart up, sometimes not at all

Reply via email to