hi max, > I have set: 'UCX_TLS=tcp,self,sm' on the slurmd's. > Is it better to build slurm without UCX support or should I simply install > rdma-core? i would look into using mellanox ofed with rdma-core, as it is what mellanox is shifting towards or has already done (not sure what 4.9 has tbh). or leave the env vars, i think for pmix it's ok unless you have very large clusters (but i'm no expert here).
> > How do I use ucx together with OpenMPI and srun now? > It works when I set this manually: > 'mpirun -np 2 -H lsm218,lsm219 --mca pml ucx -x UCX_TLS=rc -x > UCX_NET_DEVICES=mlx5_0:1 pingpong 1000 1000'. > But if I put srun before mpirun four tasks will be created, two on each > node. you let pmix do it's job and thus simply start the mpi parts with srun instead of mpirun srun pingpong 1000 1000 if you must tune UCX (as in: default behaviour is not ok), also set it via env vars. (at least try to use the defaults, it's pretty good i think) (shameless plug: one of my colleagues setup a tech talk with openmpi people wrt pmix, ucx, openmpi etc; see https://github.com/easybuilders/easybuild/issues/630 for details and link to youtube recording) stijn > > Thanks for helping me! > -max > > -----Ursprüngliche Nachricht----- > Von: Stijn De Weirdt <stijn.dewei...@ugent.be> > Gesendet: Mittwoch, 12. August 2020 22:30 > An: slurm-users@lists.schedmd.com > Betreff: Re: [slurm-users] [External] Re: openmpi / UCX / srun > > hi max, > > are you using rdma-core with mellanox ofed? and do you have any uverbs_write > error messages in dmesg on the hosts? there is an issue with rdma vs tcp in > ucx+pmix when rdma-core is not used. the workaournd for the issue is to > start slurmd on the nodes with environment 'UCX_TLS=tcp,self,sm' (and not > set UCX_TLS in the application > environment) (so the ucx used by pmix does not do rdma, which is ok-ish; the > app itself will use default ucx which will pick rdma instead of tcp) > > stijn > > On 8/12/20 9:25 PM, Max Quast wrote: >> Hello Prentice, >> >> sorry for that. >> >> My post refers to a post by Dean Hidas on Mon Jun 17 17:40:56 UTC 2019: >> >> >> >>> Hello, >> >>> >> >>> I am trying to use ucx with slurm/pmix and run into the error below. >> The following works using mpirun, but what I was hoping was the srun >> equivalent fails. Is there some flag or configuration I might be >> missing for slurm? >> >>> >> >>> Works fine: >> >>> mpirun -n 100 --host apcpu-004:88,apcpu-005:88 --mca pml ucx --mca >>> osc >> ucx ./hello >> >>> >> >>> does not work: >> >>> srun -n 100 ./hello >> >>> slurmstepd: error: apcpu-004 [0] pmixp_dconn_ucx.c:668 [_ucx_connect] >> mpi/pmix: ERROR: ucp_ep_create failed: Input/output error >> >>> slurmstepd: error: apcpu-004 [0] pmixp_dconn.h:243 >> [pmixp_dconn_connect] mpi/pmix: ERROR: Cannot establish direct >> connection to apcpu-005 (1) >> >>> slurmstepd: error: apcpu-004 [0] pmixp_server.c:731 >> [_process_extended_hdr] mpi/pmix: ERROR: Unable to connect to 1 >> >>> slurmstepd: error: *** STEP 50.0 ON apcpu-004 CANCELLED AT >> 2019-06-17T13:30:11 *** >> >>> >> >>> The configurations for pmix, openmpi, slurm, ucx are the following >>> (on >> Debian 8): >> >>> pmix 3.1.2 >> >>> ./configure --prefix=/opt/apps/gcc-7_4/pmix/3.1.2 >> >>> >> >>> openmpi 4.0.1 >> >>> ./configure --prefix=/opt/apps/gcc-7_4/openmpi/4.0.1 >> --with-pmix=/opt/apps/gcc-7_4/pmix/3.1.2 >> --with-libfabric=/opt/apps/gcc-7_4/libfabric/1.7.2 >> --with-ucx=/opt/apps/gcc-7_4/ucx/1.5.1 --with-libevent=external >> --disable-dlopen --without-verbs >> >>> >> >>> slurm 19.05.0 >> >>> ./configure --enable-debug --enable-x11 >> --with-pmix=/opt/apps/gcc-7_4/pmix/3.1.2 --sysconfdir=/etc/slurm >> --prefix=/opt/apps/slurm/19.05.0 >> --with-ucx=/opt/apps/gcc-7_4/ucx/1.5.1 >> >>> >> >>> ucx 1.5.1 >> >>> ./configure --enable-optimizations --disable-logging --disable-debug >> --disable-assertions --disable-params-check >> --prefix=/opt/apps/gcc-7_4/ucx/1.5.1 >> >>> >> >>> Any advice is much appreciated. >> >>> >> >>> Best, >> >>> >> >>> -Dean >> >> >> >>>> Max, >> >>>> You didn't quote the original e-mail so I'm not sure what the >> original problem was, or who "you" is. >> >>>> -- >> >>>> Prentice >> >>>> On 8/12/20 6:55 AM, Max Quast wrote: >> >>>> I am also trying to use ucx with slurm/PMIx and get the same error. >> Also mpirun with "--mca pml ucx" works fine. >> >>>> >> >>>> Used versions: >> >>>> Ubuntu 20.04 >> >>>> slurm 20.02.4 >> >>>> OMPI 4.0.4 >> >>>> PMIx 3.1.5 >> >>>> UCX 1.9.0-rc1 >> >>>> OFED 4.9 >> >>>> >> >>>> With ucx 1.8.1 I got a slightly different error: >> >>>> error: host1 [0] pmixp_dconn_ucx.c:245 [pmixp_dconn_ucx_prepare] >> mpi/pmix: ERROR: Fail to init UCX: Unsupported operation >> >>>> [2020-08-11T20:24:48.117] [2.0] error: host1 [0] pmixp_dconn.c:72 >> [pmixp_dconn_init] mpi/pmix: ERROR: Cannot get polling fd >> >>>> [2020-08-11T20:24:48.117] [2.0] error: host1 [0] pmixp_server.c:402 >> [pmixp_stepd_init] mpi/pmix: ERROR: pmixp_dconn_init() failed >> >>>> [2020-08-11T20:24:48.117] [2.0] error: (null) [0] mpi_pmix.c:161 >> [p_mpi_hook_slurmstepd_prefork] mpi/pmix: ERROR: pmixp_stepd_init() >> failed >> >>>> [2020-08-11T20:24:48.119] [2.0] error: Failed >>>> mpi_hook_slurmstepd_prefork >> >>>> [2020-08-11T20:24:48.121] [2.0] error: job_manager exiting >> abnormally, rc = -1 >> >>>> >> >>>> Did you solve the problem? >> >>>> >> >>>> >> >>>> Greetings, >> >>>> Max >> >>>> -- >> >>>> Prentice >> >>>> >>