Re: [slurm-users] nodes going to down* and getting stuck in that state

2021-05-20 Thread Tim Carlson
The SLURM controller AND all the compute nodes need to know who all is in the cluster. If you want to add a node or it changes IP addresses, you need to let all the nodes know about this which, for me, usually means restarting slurmd on the compute nodes. I just say this because I get caught by th

Re: [slurm-users] inconsistent CUDA_VISIBLE_DEVICES with srun vs sbatch

2021-05-19 Thread Tim Carlson
understand how with "shared=exclusive" srun gives one result and sbatch gives another. Tim On Wed, May 19, 2021 at 11:26 AM Tim Carlson wrote: > Hey folks, > > Here is my setup: > > slurm-20.11.4 on x86_64 running Centos 7.x with CUDA 11.1 > > The relevant parts

[slurm-users] inconsistent CUDA_VISIBLE_DEVICES with srun vs sbatch

2021-05-19 Thread Tim Carlson
Hey folks, Here is my setup: slurm-20.11.4 on x86_64 running Centos 7.x with CUDA 11.1 The relevant parts of the slurm.conf and a particular gres.conf file are: SelectType=select/cons_res SelectTypeParameters=CR_Core PriorityType=priority/multifactor GresTypes=gpu NodeName=dlt[01-12] Gr