Compile slurm without ucx support. We wound up spending quality time with the Mellanox... wait, no, NVIDIA Networking UCX folks to get this sorted out.
I recommend using SLURM 20 rather than 19. regards, s On Thu, Oct 22, 2020 at 10:23 AM Michael Di Domenico <mdidomeni...@gmail.com> wrote: > was there ever a result to this? i'm seeing the same error message, > but i'm not adding in all the environ flags like the original poster. > > On Wed, Jul 10, 2019 at 9:18 AM Daniel Letai <d...@letai.org.il> wrote: > > > > Thank you Artem, > > > > > > I've made a mistake while typing the mail, in all cases it was > 'OMPI_MCA_pml=ucx' and not as written. When I went over the mail before > sending, I must have erroneously 'fixed' it for some reason. > > > > > > ---- > > > > Best regards, > > > > --Dani_L. > > > > > > On 7/9/19 9:06 PM, Artem Polyakov wrote: > > > > Hello, Daniel > > > > Let me try to reproduce locally and get back to you. > > > > ---- > > Best regards, > > Artem Y. Polyakov, PhD > > Senior Architect, SW > > Mellanox Technologies > > ________________________________ > > От: p...@googlegroups.com <p...@googlegroups.com> от имени Daniel Letai > <d...@letai.org.il> > > Отправлено: Tuesday, July 9, 2019 3:25:22 AM > > Кому: Slurm User Community List; p...@googlegroups.com; > ucx-gr...@elist.ornl.gov > > Тема: [pmix] [Cross post - Slurm, PMIx, UCX] Using srun with > SLURM_PMIX_DIRECT_CONN_UCX=true fails with input/output error > > > > > > Cross posting to Slurm, PMIx and UCX lists. > > > > > > Trying to execute a simple openmpi (4.0.1) mpi-hello-world via Slurm > (19.05.0) compiled with both PMIx (3.1.2) and UCX (1.5.0) results in: > > > > > > [root@n1 ~]# SLURM_PMIX_DIRECT_CONN_UCX=true > SLURM_PMIX_DIRECT_CONN=true OMPI_MCA_pml=true > OMPI_MCA_btl='^vader,tcp,openib' UCX_NET_DEVICES='mlx4_0:1' > SLURM_PMIX_DIRECT_CONN_EARLY=false UCX_TLS=rc,shm srun --export > SLURM_PMIX_DIRECT_CONN_UCX,SLURM_PMIX_DIRECT_CONN,OMPI_MCA_pml,OMPI_MCA_btl, > UCX_NET_DEVICES,SLURM_PMIX_DIRECT_CONN_EARLY,UCX_TLS --mpi=pmix -N 2 -n 2 > /data/mpihello/mpihello > > > > > > slurmstepd: error: n1 [0] pmixp_dconn_ucx.c:668 [_ucx_connect] mpi/pmix: > ERROR: ucp_ep_create failed: Input/output error > > slurmstepd: error: n1 [0] pmixp_dconn.h:243 [pmixp_dconn_connect] > mpi/pmix: ERROR: Cannot establish direct connection to n2 (1) > > slurmstepd: error: n1 [0] pmixp_server.c:731 [_process_extended_hdr] > mpi/pmix: ERROR: Unable to connect to 1 > > srun: Job step aborted: Waiting up to 32 seconds for job step to finish. > > slurmstepd: error: n2 [1] pmixp_dconn_ucx.c:668 [_ucx_connect] mpi/pmix: > ERROR: ucp_ep_create failed: Input/output error > > slurmstepd: error: n2 [1] pmixp_dconn.h:243 [pmixp_dconn_connect] > mpi/pmix: ERROR: Cannot establish direct connection to n1 (0) > > slurmstepd: error: *** STEP 7202.0 ON n1 CANCELLED AT > 2019-07-01T13:20:36 *** > > slurmstepd: error: n2 [1] pmixp_server.c:731 [_process_extended_hdr] > mpi/pmix: ERROR: Unable to connect to 0 > > srun: error: n2: task 1: Killed > > srun: error: n1: task 0: Killed > > > > > > However, the following works: > > > > > > [root@n1 ~]# SLURM_PMIX_DIRECT_CONN_UCX=false > SLURM_PMIX_DIRECT_CONN=true OMPI_MCA_pml=true > OMPI_MCA_btl='^vader,tcp,openib' UCX_NET_DEVICES='mlx4_0:1' > SLURM_PMIX_DIRECT_CONN_EARLY=false UCX_TLS=rc,shm srun --export > SLURM_PMIX_DIRECT_CONN_UCX,SLURM_PMIX_DIRECT_CONN,OMPI_MCA_pml,OMPI_MCA_btl, > UCX_NET_DEVICES,SLURM_PMIX_DIRECT_CONN_EARLY,UCX_TLS --mpi=pmix -N 2 -n 2 > /data/mpihello/mpihello > > > > > > n2: Process 1 out of 2 > > n1: Process 0 out of 2 > > > > > > [root@n1 ~]# SLURM_PMIX_DIRECT_CONN_UCX=false > SLURM_PMIX_DIRECT_CONN=true OMPI_MCA_pml=true > OMPI_MCA_btl='^vader,tcp,openib' UCX_NET_DEVICES='mlx4_0:1' > SLURM_PMIX_DIRECT_CONN_EARLY=true UCX_TLS=rc,shm srun --export > SLURM_PMIX_DIRECT_CONN_UCX,SLURM_PMIX_DIRECT_CONN,OMPI_MCA_pml,OMPI_MCA_btl, > UCX_NET_DEVICES,SLURM_PMIX_DIRECT_CONN_EARLY,UCX_TLS --mpi=pmix -N 2 -n 2 > /data/mpihello/mpihello > > > > > > n2: Process 1 out of 2 > > n1: Process 0 out of 2 > > > > > > Executing mpirun directly (same env vars, without the slurm vars) works, > so UCX appears to function correctly. > > > > > > If both SLURM_PMIX_DIRECT_CONN_EARLY=true and > SLURM_PMIX_DIRECT_CONN_UCX=true then I get collective timeout errors from > mellanox/hcoll and glibc detected /data/mpihello/mpihello: malloc(): memory > corruption (fast) > > > > > > Can anyone help using PMIx direct connection with UCX in Slurm? > > > > > > > > > > Some info about my setup: > > > > > > UCX version > > > > [root@n1 ~]# ucx_info -v > > > > # UCT version=1.5.0 revision 02078b9 > > # configured with: --build=x86_64-redhat-linux-gnu > --host=x86_64-redhat-linux-gnu --target=x86_64-redhat-linux-gnu > --program-prefix= --prefix=/usr --exec-prefix=/usr --bindir=/usr/bin > --sbindir=/usr/sbin --sysconfdir=/etc --datadir=/usr/share > --includedir=/usr/include --libdir=/usr/lib64 --libexecdir=/usr/libexec > --localstatedir=/var --sharedstatedir=/var/lib --mandir=/usr/share/man > --infodir=/usr/share/info --disable-optimizations --disable-logging > --disable-debug --disable-assertions --enable-mt --disable-params-check > > > > > > Mellanox OFED version: > > > > [root@n1 ~]# ofed_info -s > > OFED-internal-4.5-1.0.1: > > > > > > Slurm: > > > > slurm was built with: > > rpmbuild -ta slurm-19.05.0.tar.bz2 --without debug --with ucx --define > '_with_pmix --with-pmix=/usr' > > > > > > PMIx: > > > > [root@n1 ~]# pmix_info -c --parsable > > config:user:root > > config:timestamp:"Mon Mar 25 09:51:04 IST 2019" > > config:host:slurm-test > > config:cli: '--host=x86_64-redhat-linux-gnu' > '--build=x86_64-redhat-linux-gnu' '--program-prefix=' '--prefix=/usr' > '--exec-prefix=/usr' '--bindir=/usr/bin' '--sbindir=/usr/sbin' > '--sysconfdir=/etc' '--datadir=/usr/share' '--includedir=/usr/include' > '--libdir=/usr/lib64' '--libexecdir=/usr/libexec' '--localstatedir=/var' > '--sharedstatedir=/var/lib' '--mandir=/usr/share/man' > '--infodir=/usr/share/info' > > > > > > Thanks, > > > > --Dani_L. > > > > -- > > You received this message because you are subscribed to the Google > Groups "pmix" group. > > To unsubscribe from this group and stop receiving emails from it, send > an email to pmix+unsubscr...@googlegroups.com. > > To post to this group, send email to p...@googlegroups.com. > > Visit this group at https://groups.google.com/group/pmix. > > To view this discussion on the web visit > https://groups.google.com/d/msgid/pmix/ce4a81a4-b3f7-48ce-4b9c-a5ebb098862c%40letai.org.il > . > > For more options, visit https://groups.google.com/d/optout. > > -- > > You received this message because you are subscribed to the Google > Groups "pmix" group. > > To unsubscribe from this group and stop receiving emails from it, send > an email to pmix+unsubscr...@googlegroups.com. > > To post to this group, send email to p...@googlegroups.com. > > Visit this group at https://groups.google.com/group/pmix. > > To view this discussion on the web visit > https://groups.google.com/d/msgid/pmix/DB6PR0501MB2791254A7057631C06FA094DBAF10%40DB6PR0501MB2791.eurprd05.prod.outlook.com > . > > For more options, visit https://groups.google.com/d/optout. > > > > -- > > You received this message because you are subscribed to the Google > Groups "pmix" group. > > To unsubscribe from this group and stop receiving emails from it, send > an email to pmix+unsubscr...@googlegroups.com. > > To post to this group, send email to p...@googlegroups.com. > > Visit this group at https://groups.google.com/group/pmix. > > To view this discussion on the web visit > https://groups.google.com/d/msgid/pmix/62424be1-1f5f-43cb-9901-07a0a03915f0%40letai.org.il > . > > For more options, visit https://groups.google.com/d/optout. > >