Here is my SLURM script: #!/bin/bash
#SBATCH --job-name="gpu_test" #SBATCH --output=gpu_test_%j.log # Standard output and error log #SBATCH --account=berceanu_a+ #SBATCH --partition=gpu #SBATCH --cpus-per-task=1 #SBATCH --mem-per-cpu=31200m # Reserve 32 GB of RAM per core #SBATCH --time=12:00:00 # Max allowed job runtime #SBATCH --gres=gpu:16 # Allocate four GPUs export SLURM_EXACT=1 srun --mpi=pmi2 -n 1 --gpus-per-node 1 python gpu_test.py & srun --mpi=pmi2 -n 1 --gpus-per-node 1 python gpu_test.py & srun --mpi=pmi2 -n 1 --gpus-per-node 1 python gpu_test.py & srun --mpi=pmi2 -n 1 --gpus-per-node 1 python gpu_test.py & wait What I expect this to do is to run, in parallel, 4 independent copies of the gpu_test.py python script, using 4 out of the 16 GPUs on this node. What it actually does is it only runs the script on a single GPU - it's as if the other 3 srun commands do nothing. Perhaps they do not see any available GPUs for some reason? System info: slurm 19.05.2 Linux 5.4.0-90-generic #101~18.04.1-Ubuntu SMP x86_64 x86_64 x86_64 GNU/Linux PARTITION AVAIL TIMELIMIT NODES STATE NODELIST gpu up infinite 1 idle thor NodeName=thor Arch=x86_64 CoresPerSocket=24 CPUAlloc=0 CPUTot=48 CPULoad=0.45 AvailableFeatures=(null) ActiveFeatures=(null) Gres=gpu:16(S:0-1) NodeAddr=thor NodeHostName=thor OS=Linux 5.4.0-90-generic #101~18.04.1-Ubuntu SMP Fri Oct 22 09:25:04 UTC 2021 RealMemory=1546812 AllocMem=0 FreeMem=1433049 Sockets=2 Boards=1 State=IDLE ThreadsPerCore=1 TmpDisk=0 Weight=1 Owner=N/A MCS_label=N/A Partitions=gpu BootTime=2023-08-09T14:58:01 SlurmdStartTime=2023-08-09T14:58:36 CfgTRES=cpu=48,mem=1546812M,billing=48,gres/gpu=16 AllocTRES= CapWatts=n/a CurrentWatts=0 AveWatts=0 ExtSensorsJoules=n/s ExtSensorsWatts=0 ExtSensorsTemp=n/s I can add any additional system info as required. Thank you so much for taking the time to read this, Regards, Andrei