Try scontrol update NodeName=heimdall state=DOWN Reason="gpu issue"
and then scontrol update NodeName=heimdall state=RESUME to see if it will work. Probably just SLURM daemon having a hiccup after you made changes. Best, Feng On Mon, Oct 16, 2023 at 10:43 AM Gregor Hagelueken <hagelue...@uni-bonn.de> wrote: > > Hi, > > We have a ubuntu server (22.04) with currently 5 GPUs (1 x l40 and 4 x > rtx_a5000). > I am trying to configure slurm such that a user can select either the l40 or > a5000 gpus for a particular job. > I have configured my slurm.conf and gres.conf files similar as in this old > thread: > https://groups.google.com/g/slurm-users/c/fc-eoHpTNwU > I have pasted the contents of the two files below. > > Unfortunately, my node is always on “drain” and scontrol shows this error: > Reason=gres/gpu count reported lower than configured (1 < 5) > > Any idea what I am doing wrong? > Cheers and thanks for your help! > Gregor > > Here are my slurm.conf and gres.conf files. > > AutoDetect=off > NodeName=heimdall Name=gpu Type=l40 File=/dev/nvidia0 > NodeName=heimdall Name=gpu Type=a5000 File=/dev/nvidia1 > NodeName=heimdall Name=gpu Type=a5000 File=/dev/nvidia2 > NodeName=heimdall Name=gpu Type=a5000 File=/dev/nvidia3 > NodeName=heimdall Name=gpu Type=a5000 File=/dev/nvidia4 > > > # slurm.conf file generated by configurator.html. > # Put this file on all nodes of your cluster. > # See the slurm.conf man page for more information. > # > SlurmdDebug=debug2 > # > ClusterName=heimdall > SlurmctldHost=localhost > MpiDefault=none > ProctrackType=proctrack/linuxproc > ReturnToService=2 > SlurmctldPidFile=/var/run/slurmctld.pid > SlurmctldPort=6817 > SlurmdPidFile=/var/run/slurmd.pid > SlurmdPort=6818 > SlurmdSpoolDir=/var/lib/slurm/slurmd > SlurmUser=slurm > StateSaveLocation=/var/lib/slurm/slurmctld > SwitchType=switch/none > TaskPlugin=task/none > # > # TIMERS > InactiveLimit=0 > KillWait=30 > MinJobAge=300 > SlurmctldTimeout=120 > SlurmdTimeout=300 > Waittime=0 > # SCHEDULING > SchedulerType=sched/backfill > SelectType=select/cons_tres > SelectTypeParameters=CR_Core > GresTypes=gpu > # > #AccountingStoragePort= > AccountingStorageType=accounting_storage/none > JobCompType=jobcomp/none > JobAcctGatherFrequency=30 > JobAcctGatherType=jobacct_gather/none > SlurmctldDebug=info > SlurmctldLogFile=/var/log/slurm/slurmctld.log > SlurmdDebug=info > SlurmdLogFile=/var/log/slurm/slurmd.log > # > # COMPUTE NODES > NodeName=heimdall CPUs=128 Gres=gpu:l40:1,gpu:a5000:4 Boards=1 > SocketsPerBoard=2 CoresPerSocket=32 ThreadsPerCore=2 RealMemory=773635 > State=UNKNOWN > PartitionName=heimdall Nodes=ALL Default=YES MaxTime=INFINITE State=UP > DefMemPerCPU=8000 DefCpuPerGPU=16 > >