Running test job with srun works: wayneh@login:~$ srun -G16 -p v100 /home/wayne.hendricks/job.sh 179851 Linux dgx1-1 5.4.0-94-generic #106-Ubuntu SMP Thu Jan 6 23:58:14 UTC 2022 x86_64 x86_64 x86_64 GNU/Linux 179851 Linux dgx1-2 5.4.0-94-generic #106-Ubuntu SMP Thu Jan 6 23:58:14 UTC 2022 x86_64 x86_64 x86_64 GNU/Linux
Submitting the same with sbatch does not: wayneh@login:~$ sbatch test.sh Submitted batch job 179850 wayneh@login:~$ cat test.out srun: error: Unable to create step for job 179850: Unspecified error wayneh@login:~$ cat test.sh #!/usr/bin/env bash #SBATCH -J testing #SBATCH -e /home/wayne.hendricks/test.out #SBATCH -o /home/wayne.hendricks/test.out #SBATCH -G 16 #SBATCH --partition v100 srun uname -a Any idea why srun and sbatch wouldn't run the same way? It seems to run correctly when I use an odd number of GPUs in sbatch. (#SBATCH -G 15) Node config: NodeName=dgx1-[1-10] CPUs=80 Sockets=2 CoresPerSocket=20 ThreadsPerCore=2 RealMemory=490000 Gres=gpu:8 State=UNKNOWN PartitionName=v100 Nodes=dgx1-[1-10] OverSubscribe=FORCE:8 DefCpuPerGPU=10 DefMemPerGPU=61250 MaxTime=INFINITE State=UP