Hey stijn, thank you very much for the advice!
Answer to your questions: Q: are you using rdma-core with mellanox ofed? A: only mellanox ofed, no rdma-core Q: and do you have any uverbs_write error messages in dmesg on the hosts? A: Yes, I have! I have set: 'UCX_TLS=tcp,self,sm' on the slurmd's. Is it better to build slurm without UCX support or should I simply install rdma-core? How do I use ucx together with OpenMPI and srun now? It works when I set this manually: 'mpirun -np 2 -H lsm218,lsm219 --mca pml ucx -x UCX_TLS=rc -x UCX_NET_DEVICES=mlx5_0:1 pingpong 1000 1000'. But if I put srun before mpirun four tasks will be created, two on each node. Thanks for helping me! -max -----Ursprüngliche Nachricht----- Von: Stijn De Weirdt <stijn.dewei...@ugent.be> Gesendet: Mittwoch, 12. August 2020 22:30 An: slurm-users@lists.schedmd.com Betreff: Re: [slurm-users] [External] Re: openmpi / UCX / srun hi max, are you using rdma-core with mellanox ofed? and do you have any uverbs_write error messages in dmesg on the hosts? there is an issue with rdma vs tcp in ucx+pmix when rdma-core is not used. the workaournd for the issue is to start slurmd on the nodes with environment 'UCX_TLS=tcp,self,sm' (and not set UCX_TLS in the application environment) (so the ucx used by pmix does not do rdma, which is ok-ish; the app itself will use default ucx which will pick rdma instead of tcp) stijn On 8/12/20 9:25 PM, Max Quast wrote: > Hello Prentice, > > sorry for that. > > My post refers to a post by Dean Hidas on Mon Jun 17 17:40:56 UTC 2019: > > > >> Hello, > >> > >> I am trying to use ucx with slurm/pmix and run into the error below. > The following works using mpirun, but what I was hoping was the srun > equivalent fails. Is there some flag or configuration I might be > missing for slurm? > >> > >> Works fine: > >> mpirun -n 100 --host apcpu-004:88,apcpu-005:88 --mca pml ucx --mca >> osc > ucx ./hello > >> > >> does not work: > >> srun -n 100 ./hello > >> slurmstepd: error: apcpu-004 [0] pmixp_dconn_ucx.c:668 [_ucx_connect] > mpi/pmix: ERROR: ucp_ep_create failed: Input/output error > >> slurmstepd: error: apcpu-004 [0] pmixp_dconn.h:243 > [pmixp_dconn_connect] mpi/pmix: ERROR: Cannot establish direct > connection to apcpu-005 (1) > >> slurmstepd: error: apcpu-004 [0] pmixp_server.c:731 > [_process_extended_hdr] mpi/pmix: ERROR: Unable to connect to 1 > >> slurmstepd: error: *** STEP 50.0 ON apcpu-004 CANCELLED AT > 2019-06-17T13:30:11 *** > >> > >> The configurations for pmix, openmpi, slurm, ucx are the following >> (on > Debian 8): > >> pmix 3.1.2 > >> ./configure --prefix=/opt/apps/gcc-7_4/pmix/3.1.2 > >> > >> openmpi 4.0.1 > >> ./configure --prefix=/opt/apps/gcc-7_4/openmpi/4.0.1 > --with-pmix=/opt/apps/gcc-7_4/pmix/3.1.2 > --with-libfabric=/opt/apps/gcc-7_4/libfabric/1.7.2 > --with-ucx=/opt/apps/gcc-7_4/ucx/1.5.1 --with-libevent=external > --disable-dlopen --without-verbs > >> > >> slurm 19.05.0 > >> ./configure --enable-debug --enable-x11 > --with-pmix=/opt/apps/gcc-7_4/pmix/3.1.2 --sysconfdir=/etc/slurm > --prefix=/opt/apps/slurm/19.05.0 > --with-ucx=/opt/apps/gcc-7_4/ucx/1.5.1 > >> > >> ucx 1.5.1 > >> ./configure --enable-optimizations --disable-logging --disable-debug > --disable-assertions --disable-params-check > --prefix=/opt/apps/gcc-7_4/ucx/1.5.1 > >> > >> Any advice is much appreciated. > >> > >> Best, > >> > >> -Dean > > > >>> Max, > >>> You didn't quote the original e-mail so I'm not sure what the > original problem was, or who "you" is. > >>> -- > >>> Prentice > >>> On 8/12/20 6:55 AM, Max Quast wrote: > >>> I am also trying to use ucx with slurm/PMIx and get the same error. > Also mpirun with "--mca pml ucx" works fine. > >>> > >>> Used versions: > >>> Ubuntu 20.04 > >>> slurm 20.02.4 > >>> OMPI 4.0.4 > >>> PMIx 3.1.5 > >>> UCX 1.9.0-rc1 > >>> OFED 4.9 > >>> > >>> With ucx 1.8.1 I got a slightly different error: > >>> error: host1 [0] pmixp_dconn_ucx.c:245 [pmixp_dconn_ucx_prepare] > mpi/pmix: ERROR: Fail to init UCX: Unsupported operation > >>> [2020-08-11T20:24:48.117] [2.0] error: host1 [0] pmixp_dconn.c:72 > [pmixp_dconn_init] mpi/pmix: ERROR: Cannot get polling fd > >>> [2020-08-11T20:24:48.117] [2.0] error: host1 [0] pmixp_server.c:402 > [pmixp_stepd_init] mpi/pmix: ERROR: pmixp_dconn_init() failed > >>> [2020-08-11T20:24:48.117] [2.0] error: (null) [0] mpi_pmix.c:161 > [p_mpi_hook_slurmstepd_prefork] mpi/pmix: ERROR: pmixp_stepd_init() > failed > >>> [2020-08-11T20:24:48.119] [2.0] error: Failed >>> mpi_hook_slurmstepd_prefork > >>> [2020-08-11T20:24:48.121] [2.0] error: job_manager exiting > abnormally, rc = -1 > >>> > >>> Did you solve the problem? > >>> > >>> > >>> Greetings, > >>> Max > >>> -- > >>> Prentice > >>> >
smime.p7s
Description: S/MIME cryptographic signature