On Mon, 16 Oct 2017 13:11:37 -0400 Michael Di Domenico <mdidomeni...@gmail.com> wrote:
> On Mon, Oct 16, 2017 at 7:16 AM, Peter Kjellström <c...@nsc.liu.se> > wrote: > > Another is that your MPIs tried to use rdmacm and that in turn > > tried to use ibacm which, if incorrectly setup, times out after > > ~1m. You can verify ibacm functionality by running for example: > > > > user@n1 $ ib_acme -d n2 > > ... > > user@n1 $ > > > > This should be near instant if ibacm works as it should. > > i didn't specifically tell mpi to use one connection setup vs another, > but i'll see if i can track down what openmpi is doing in that regard. > > however, your test above fails on my machines > > user@n1# ib_acme -d n3 > service: localhost > destination: n3 > ib_acm_resolve_ip failed: cannot assign requested address > return status 0x0 Did this fail instantly or with the typical ~1m timeout? > in the /etc/rdma/ibacme_addr.cfg file i just lists the data specific > to each host, which is gathered by ib_acme -A Often you don't need ibacm running and if you stop it this specific problem will go away (ie. no one can ask ibacm for stuff and hang on timeout). The service is typically /etc/init.d/ibacm. What will happen then if something uses librdmacm for lookups is that it will result in a direct query to the SA (part of the subnet manager). On a larger cluster and for certain use cases this can quickly become too much (hence the need for caching). If you have IntelMPI also try what I suggested and use the ucm dapl. For example for the first port on an mlx4 hca that's "ofa-v2-mlx4_0-1u". You can make sure that it comes first in your dat.conf (/etc/rmda or /etc/infiniband) or pass it explicitly to IntelMPI: I_MPI_DAPL_PROVIDER=ofa-v2-mlx4_0-1u mpiexec.hydra ... You may want to set I_MPI_DEBUG=4 or so to see what it does. /Peter K _______________________________________________ Beowulf mailing list, Beowulf@beowulf.org sponsored by Penguin Computing To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf