Hello Prentice, sorry for that.
My post refers to a post by Dean Hidas on Mon Jun 17 17:40:56 UTC 2019: > Hello, > > I am trying to use ucx with slurm/pmix and run into the error below. The following works using mpirun, but what I was hoping was the srun equivalent fails. Is there some flag or configuration I might be missing for slurm? > > Works fine: > mpirun -n 100 --host apcpu-004:88,apcpu-005:88 --mca pml ucx --mca osc ucx ./hello > > does not work: > srun -n 100 ./hello > slurmstepd: error: apcpu-004 [0] pmixp_dconn_ucx.c:668 [_ucx_connect] mpi/pmix: ERROR: ucp_ep_create failed: Input/output error > slurmstepd: error: apcpu-004 [0] pmixp_dconn.h:243 [pmixp_dconn_connect] mpi/pmix: ERROR: Cannot establish direct connection to apcpu-005 (1) > slurmstepd: error: apcpu-004 [0] pmixp_server.c:731 [_process_extended_hdr] mpi/pmix: ERROR: Unable to connect to 1 > slurmstepd: error: *** STEP 50.0 ON apcpu-004 CANCELLED AT 2019-06-17T13:30:11 *** > > The configurations for pmix, openmpi, slurm, ucx are the following (on Debian 8): > pmix 3.1.2 > ./configure --prefix=/opt/apps/gcc-7_4/pmix/3.1.2 > > openmpi 4.0.1 > ./configure --prefix=/opt/apps/gcc-7_4/openmpi/4.0.1 --with-pmix=/opt/apps/gcc-7_4/pmix/3.1.2 --with-libfabric=/opt/apps/gcc-7_4/libfabric/1.7.2 --with-ucx=/opt/apps/gcc-7_4/ucx/1.5.1 --with-libevent=external --disable-dlopen --without-verbs > > slurm 19.05.0 > ./configure --enable-debug --enable-x11 --with-pmix=/opt/apps/gcc-7_4/pmix/3.1.2 --sysconfdir=/etc/slurm --prefix=/opt/apps/slurm/19.05.0 --with-ucx=/opt/apps/gcc-7_4/ucx/1.5.1 > > ucx 1.5.1 > ./configure --enable-optimizations --disable-logging --disable-debug --disable-assertions --disable-params-check --prefix=/opt/apps/gcc-7_4/ucx/1.5.1 > > Any advice is much appreciated. > > Best, > > -Dean >> Max, >> You didn't quote the original e-mail so I'm not sure what the original problem was, or who "you" is. >> -- >> Prentice >> On 8/12/20 6:55 AM, Max Quast wrote: >> I am also trying to use ucx with slurm/PMIx and get the same error. Also mpirun with "--mca pml ucx" works fine. >> >> Used versions: >> Ubuntu 20.04 >> slurm 20.02.4 >> OMPI 4.0.4 >> PMIx 3.1.5 >> UCX 1.9.0-rc1 >> OFED 4.9 >> >> With ucx 1.8.1 I got a slightly different error: >> error: host1 [0] pmixp_dconn_ucx.c:245 [pmixp_dconn_ucx_prepare] mpi/pmix: ERROR: Fail to init UCX: Unsupported operation >> [2020-08-11T20:24:48.117] [2.0] error: host1 [0] pmixp_dconn.c:72 [pmixp_dconn_init] mpi/pmix: ERROR: Cannot get polling fd >> [2020-08-11T20:24:48.117] [2.0] error: host1 [0] pmixp_server.c:402 [pmixp_stepd_init] mpi/pmix: ERROR: pmixp_dconn_init() failed >> [2020-08-11T20:24:48.117] [2.0] error: (null) [0] mpi_pmix.c:161 [p_mpi_hook_slurmstepd_prefork] mpi/pmix: ERROR: pmixp_stepd_init() failed >> [2020-08-11T20:24:48.119] [2.0] error: Failed mpi_hook_slurmstepd_prefork >> [2020-08-11T20:24:48.121] [2.0] error: job_manager exiting abnormally, rc = -1 >> >> Did you solve the problem? >> >> >> Greetings, >> Max >> -- >> Prentice >>
smime.p7s
Description: S/MIME cryptographic signature