date:20220115

slurm-users@lists.schedmd.com

2022-01-15 Thread Wayne Hendricks

Also I have noticed that the behavior only crops up in sbatch when multiples of whole nodes are requested. One single node runs fine. But say, on 8 GPU systems 16/24/32 GPU jobs fail, whereas 15/23/31 GPU jobs run fine. A manual srun command does not have any issues requesting any of these configur

slurm-users@lists.schedmd.com

2022-01-15 Thread Wayne Hendricks

The only thing that jumps out on the ctl logs is: error: step_layout_create: no usable CPUs The node logs were unremarkable. It doesn't make much sense to me that the same job with srun or an odd number of GPUs in sbatch works. I suspect something isn't adding up right somewhere. On Sat, Jan 15,