Hello Tina,
Thank you for the suggestions and responses!!!
As of right now, it seems to be working with taking off the “CPUs=“ all
together from gres.conf. The original thought process was to have 4 set aside
to always go to the gpu; not so sure that is necessary as long as the CPU
partition can
hi max,
are you using rdma-core with mellanox ofed? and do you have any
uverbs_write error messages in dmesg on the hosts? there is an issue
with rdma vs tcp in ucx+pmix when rdma-core is not used. the workaournd
for the issue is to start slurmd on the nodes with environment
'UCX_TLS=tcp,self,sm'
Max,
You didn't quote the original e-mail so I'm not sure what the original
problem was, or who "you" is.
--
Prentice
On 8/12/20 6:55 AM, Max Quast wrote:
I am also trying to use ucx with slurm/PMIx and get the same error.
Also mpirun with "--mca pml ucx" works fine.
Used versions:
Ubu
Hello Prentice,
sorry for that.
My post refers to a post by Dean Hidas on Mon Jun 17 17:40:56 UTC 2019:
> Hello,
>
> I am trying to use ucx with slurm/pmix and run into the error below. The
following works using mpirun, but what I was hoping was the srun equivalent
fails. Is there some
I am also trying to use ucx with slurm/PMIx and get the same error. Also
mpirun with "--mca pml ucx" works fine.
Used versions:
Ubuntu 20.04
slurm 20.02.4
OMPI 4.0.4
PMIx 3.1.5
UCX 1.9.0-rc1
OFED 4.9
With ucx 1.8.1 I got a slightly different error:
error: host1 [0] pmixp_dconn_ucx.