Re: [slurm-users] Only 2 jobs will start per GPU node despite 4 GPU's being present

2020-08-12 Thread Jodie H. Sprouse
Hello Tina, Thank you for the suggestions and responses!!! As of right now, it seems to be working with taking off the “CPUs=“ all together from gres.conf. The original thought process was to have 4 set aside to always go to the gpu; not so sure that is necessary as long as the CPU partition can

Re: [slurm-users] [External] Re: openmpi / UCX / srun

2020-08-12 Thread Stijn De Weirdt
hi max, are you using rdma-core with mellanox ofed? and do you have any uverbs_write error messages in dmesg on the hosts? there is an issue with rdma vs tcp in ucx+pmix when rdma-core is not used. the workaournd for the issue is to start slurmd on the nodes with environment 'UCX_TLS=tcp,self,sm'

Re: [slurm-users] [External] Re: openmpi / UCX / srun

2020-08-12 Thread Prentice Bisbal
Max, You didn't quote the original e-mail so I'm not sure what the original problem was, or who "you" is. -- Prentice On 8/12/20 6:55 AM, Max Quast wrote: I am also trying to use ucx with slurm/PMIx and get the same error.  Also mpirun with "--mca pml ucx" works fine. Used versions: Ubu

Re: [slurm-users] [External] Re: openmpi / UCX / srun

2020-08-12 Thread Max Quast
Hello Prentice, sorry for that. My post refers to a post by Dean Hidas on Mon Jun 17 17:40:56 UTC 2019: > Hello, > > I am trying to use ucx with slurm/pmix and run into the error below. The following works using mpirun, but what I was hoping was the srun equivalent fails. Is there some

Re: [slurm-users] openmpi / UCX / srun

2020-08-12 Thread Max Quast
I am also trying to use ucx with slurm/PMIx and get the same error. Also mpirun with "--mca pml ucx" works fine. Used versions: Ubuntu 20.04 slurm 20.02.4 OMPI 4.0.4 PMIx 3.1.5 UCX 1.9.0-rc1 OFED 4.9 With ucx 1.8.1 I got a slightly different error: error: host1 [0] pmixp_dconn_ucx.