Also I have noticed that the behavior only crops up in sbatch when
multiples of whole nodes are requested. One single node runs fine. But
say, on 8 GPU systems 16/24/32 GPU jobs fail, whereas 15/23/31 GPU
jobs run fine. A manual srun command does not have any issues
requesting any of these configur
The only thing that jumps out on the ctl logs is:
error: step_layout_create: no usable CPUs
The node logs were unremarkable.
It doesn't make much sense to me that the same job with srun or an odd
number of GPUs in sbatch works. I suspect something isn't adding up
right somewhere.
On Sat, Jan 15,