[slurm-users] Need help with running multiple instances/executions of a batch script in parallel (with NVIDIA HGX A100 GPU as a Gres)

Kherfani, Hafedh (Professional Services, TC) Thu, 18 Jan 2024 03:55:13 -0800

Hello Experts,

I'm a new Slurm user (so please bare with me :)  ...).
Recently we've deployed Slurm version 23.11 on a very simple cluster, which 
consists of a Master node (acting as a Login & Slurmdbd node as well), a 
Compute Node which has a NVIDIA HGX A100-SXM4-40GB GPU, detected as 4 x GPU's: 
GPU [0-4], and a Storage Array presenting/sharing the NFS disk (where users' 
home directories will be created as well).


The problem is that I've never been able to run a simple/dummy batch script in 
a parallel way using the 4 GPU's. In fact, running the same command "sbatch 
gpu-job.sh" multiple times shows that only one single job is running, while the 
other jobs are in a pending state:

[slurmtest@c-a100-master test-batch-scripts]$ sbatch gpu-job.sh
Submitted batch job 214
[slurmtest@c-a100-master test-batch-scripts]$ sbatch gpu-job.sh
Submitted batch job 215
[slurmtest@c-a100-master test-batch-scripts]$ sbatch gpu-job.sh
Submitted batch job 216
[slurmtest@c-a100-master test-batch-scripts]$ squeue
             JOBID PARTITION     NAME     USER ST       TIME  NODES 
NODELIST(REASON)
               216       gpu  gpu-job   slurmtest PD       0:00      1 (None)
               215       gpu  gpu-job   slurmtest PD       0:00      1 
(Priority)
               214       gpu  gpu-job   slurmtest PD       0:00      1 
(Priority)
               213       gpu  gpu-job   slurmtest PD       0:00      1 
(Priority)
               212       gpu  gpu-job   slurmtest PD       0:00      1 
(Resources)
               211       gpu  gpu-job   slurmtest R       0:14      1 
c-a100-cn01

PS: CPU jobs (i.e. using the default debug partition, without call the GPU 
Gres) can be run in parallel. The issue with running parallel jobs is only seen 
when using the GPU's as Gres.

I've tried many combinations of settings in gres.conf and slurm.conf, many (if 
not most) of these combinations would result in error messages in slurmctld and 
slurmd logs.

The current gres.conf and slurm.conf contents is shown below. Even though it 
doesn't give errors when restarting slurmctld and slurmd services (on master 
and compute nodes, resp.), but as I said, it doesn't allow jobs to be executed 
in parallel. Batch script contents shared below as well, in order to give more 
clarity on what I'm trying to do:

[root@c-a100-master slurm]# cat gres.conf | grep -v "^#"
NodeName=c-a100-cn01 AutoDetect=nvml Name=gpu Type=A100 File=/dev/nvidia[0-3]

[root@c-a100-master slurm]# cat slurm.conf | grep -v "^#" | egrep -i 
"AccountingStorageTRES|GresTypes|NodeName|partition"
GresTypes=gpu
AccountingStorageTRES=gres/gpu
NodeName=c-a100-cn01 Gres=gpu:A100:4 CPUs=64 Boards=1 SocketsPerBoard=1 
CoresPerSocket=32 ThreadsPerCore=2 RealMemory=515181 State=UNKNOWN
PartitionName=debug Nodes=ALL Default=YES MaxTime=INFINITE State=UP
PartitionName=gpu Nodes=ALL MaxTime=10:0:0

[slurmtest@c-a100-master test-batch-scripts]$ cat gpu-job.sh
#!/bin/bash
#SBATCH --job-name=gpu-job
#SBATCH --partition=gpu
#SBATCH --nodes=1
#SBATCH --gpus-per-node=4
#SBATCH --gres=gpu:4
#SBATCH --tasks-per-node=1
#SBATCH --output=gpu_job_output.%j   # Output file name (replaces %j with job 
ID)
#SBATCH --error=gpu_job_error.%j     # Error file name (replaces %j with job ID)

hostname
date
sleep 40
pwd


Any help on which changes need to be made to the config files (mainly 
slurm.conf and gres.cong) and/or the batch script, so that multiple jobs can be 
in a "Running" state at the same time (in parallel) ?

Thanks in advance for your help !


Best regards,

Hafedh Kherfani

[slurm-users] Need help with running multiple instances/executions of a batch script in parallel (with NVIDIA HGX A100 GPU as a Gres)

Reply via email to