Also I have noticed that the behavior only crops up in sbatch when multiples of whole nodes are requested. One single node runs fine. But say, on 8 GPU systems 16/24/32 GPU jobs fail, whereas 15/23/31 GPU jobs run fine. A manual srun command does not have any issues requesting any of these configurations.
On Sat, Jan 15, 2022 at 10:32 AM Wayne Hendricks <waynehendri...@gmail.com> wrote: > > The only thing that jumps out on the ctl logs is: > error: step_layout_create: no usable CPUs > The node logs were unremarkable. > > It doesn't make much sense to me that the same job with srun or an odd > number of GPUs in sbatch works. I suspect something isn't adding up > right somewhere. > > On Sat, Jan 15, 2022 at 12:56 AM Sean Crosby <scro...@unimelb.edu.au> wrote: > > > > Any error in slurmd.log on the node or slurmctld.log on the ctl? > > > > Sean > > ________________________________ > > From: slurm-users <slurm-users-boun...@lists.schedmd.com> on behalf of > > Wayne Hendricks <waynehendri...@gmail.com> > > Sent: Saturday, 15 January 2022 16:04 > > To: slurm-us...@schedmd.com <slurm-us...@schedmd.com> > > Subject: [EXT] [slurm-users] Strange sbatch error with 21.08.2&5 > > > > External email: Please exercise caution > > > > Running test job with srun works: > > wayneh@login:~$ srun -G16 -p v100 /home/wayne.hendricks/job.sh > > 179851 > > Linux dgx1-1 5.4.0-94-generic #106-Ubuntu SMP Thu Jan 6 23:58:14 UTC > > 2022 x86_64 x86_64 x86_64 GNU/Linux > > 179851 > > Linux dgx1-2 5.4.0-94-generic #106-Ubuntu SMP Thu Jan 6 23:58:14 UTC > > 2022 x86_64 x86_64 x86_64 GNU/Linux > > > > Submitting the same with sbatch does not: > > wayneh@login:~$ sbatch test.sh > > Submitted batch job 179850 > > wayneh@login:~$ cat test.out > > srun: error: Unable to create step for job 179850: Unspecified error > > wayneh@login:~$ cat test.sh > > #!/usr/bin/env bash > > #SBATCH -J testing > > #SBATCH -e /home/wayne.hendricks/test.out > > #SBATCH -o /home/wayne.hendricks/test.out > > #SBATCH -G 16 > > #SBATCH --partition v100 > > srun uname -a > > > > Any idea why srun and sbatch wouldn't run the same way? It seems to > > run correctly when I use an odd number of GPUs in sbatch. (#SBATCH -G > > 15) > > > > Node config: > > NodeName=dgx1-[1-10] CPUs=80 Sockets=2 CoresPerSocket=20 > > ThreadsPerCore=2 RealMemory=490000 Gres=gpu:8 State=UNKNOWN > > PartitionName=v100 Nodes=dgx1-[1-10] OverSubscribe=FORCE:8 > > DefCpuPerGPU=10 DefMemPerGPU=61250 MaxTime=INFINITE State=UP > >