On Wed, 11 Oct 2017 10:12:02 -0400
Michael Di Domenico <mdidomeni...@gmail.com> wrote:

> i'm seeing issues on a mellanox fdr10 cluster where the mpi setup and
> teardown takes longer then i expect it should on larger rank count
> jobs.  i'm only trying to run ~1000 ranks and the startup time is over
> a minute.  i tested this with both openmpi and intel mpi, both exhibit
> close to the same behavior.

First, that performance is not expected nor good. It should be sub 1s
for 1000 ranks or so YMMV...

One possibility is that you got some slow and/or flaky tcp/ip/eth
involved somehow.

Another is that your MPIs tried to use rdmacm and that in turn tried to
use ibacm which, if incorrectly setup, times out after ~1m. You can
verify ibacm functionality by running for example:

user@n1 $ ib_acme -d n2
...
user@n1 $

This should be near instant if ibacm works as it should.

If you use IntelMPI (and by default then dapl). Edit your dat.conf or
manually select the ucm dapl provider. This is fast and does not use
rdmacm.

Good luck,
 Peter K

> has anyone else seen this or might know how to fix it?  i expect ~1000
> ranks to take sometime to setup, but it seems to be taking longer then
> i think it should
> _______________________________________________
> Beowulf mailing list, Beowulf@beowulf.org sponsored by Penguin
> Computing To change your subscription (digest mode or unsubscribe)
> visit http://www.beowulf.org/mailman/listinfo/beowulf


-- 
Sent from my Android device with K-9 Mail. Please excuse my brevity.
_______________________________________________
Beowulf mailing list, Beowulf@beowulf.org sponsored by Penguin Computing
To change your subscription (digest mode or unsubscribe) visit 
http://www.beowulf.org/mailman/listinfo/beowulf

Reply via email to