The only thing that jumps out on the ctl logs is: error: step_layout_create: no usable CPUs The node logs were unremarkable.
It doesn't make much sense to me that the same job with srun or an odd number of GPUs in sbatch works. I suspect something isn't adding up right somewhere. On Sat, Jan 15, 2022 at 12:56 AM Sean Crosby <scro...@unimelb.edu.au> wrote: > > Any error in slurmd.log on the node or slurmctld.log on the ctl? > > Sean > ________________________________ > From: slurm-users <slurm-users-boun...@lists.schedmd.com> on behalf of Wayne > Hendricks <waynehendri...@gmail.com> > Sent: Saturday, 15 January 2022 16:04 > To: slurm-us...@schedmd.com <slurm-us...@schedmd.com> > Subject: [EXT] [slurm-users] Strange sbatch error with 21.08.2&5 > > External email: Please exercise caution > > Running test job with srun works: > wayneh@login:~$ srun -G16 -p v100 /home/wayne.hendricks/job.sh > 179851 > Linux dgx1-1 5.4.0-94-generic #106-Ubuntu SMP Thu Jan 6 23:58:14 UTC > 2022 x86_64 x86_64 x86_64 GNU/Linux > 179851 > Linux dgx1-2 5.4.0-94-generic #106-Ubuntu SMP Thu Jan 6 23:58:14 UTC > 2022 x86_64 x86_64 x86_64 GNU/Linux > > Submitting the same with sbatch does not: > wayneh@login:~$ sbatch test.sh > Submitted batch job 179850 > wayneh@login:~$ cat test.out > srun: error: Unable to create step for job 179850: Unspecified error > wayneh@login:~$ cat test.sh > #!/usr/bin/env bash > #SBATCH -J testing > #SBATCH -e /home/wayne.hendricks/test.out > #SBATCH -o /home/wayne.hendricks/test.out > #SBATCH -G 16 > #SBATCH --partition v100 > srun uname -a > > Any idea why srun and sbatch wouldn't run the same way? It seems to > run correctly when I use an odd number of GPUs in sbatch. (#SBATCH -G > 15) > > Node config: > NodeName=dgx1-[1-10] CPUs=80 Sockets=2 CoresPerSocket=20 > ThreadsPerCore=2 RealMemory=490000 Gres=gpu:8 State=UNKNOWN > PartitionName=v100 Nodes=dgx1-[1-10] OverSubscribe=FORCE:8 > DefCpuPerGPU=10 DefMemPerGPU=61250 MaxTime=INFINITE State=UP >