Dear Feng, That worked! Thank you! Cheers Gregor Sent from my iPhone.
> Am 16.10.2023 um 17:05 schrieb Feng Zhang <prod.f...@gmail.com>: > > Try > > scontrol update NodeName=heimdall state=DOWN Reason="gpu issue" > > and then > > scontrol update NodeName=heimdall state=RESUME > > to see if it will work. Probably just SLURM daemon having a hiccup > after you made changes. > > Best, > > Feng > >> On Mon, Oct 16, 2023 at 10:43 AM Gregor Hagelueken >> <hagelue...@uni-bonn.de> wrote: >> >> Hi, >> >> We have a ubuntu server (22.04) with currently 5 GPUs (1 x l40 and 4 x >> rtx_a5000). >> I am trying to configure slurm such that a user can select either the l40 or >> a5000 gpus for a particular job. >> I have configured my slurm.conf and gres.conf files similar as in this old >> thread: >> https://groups.google.com/g/slurm-users/c/fc-eoHpTNwU >> I have pasted the contents of the two files below. >> >> Unfortunately, my node is always on “drain” and scontrol shows this error: >> Reason=gres/gpu count reported lower than configured (1 < 5) >> >> Any idea what I am doing wrong? >> Cheers and thanks for your help! >> Gregor >> >> Here are my slurm.conf and gres.conf files. >> >> AutoDetect=off >> NodeName=heimdall Name=gpu Type=l40 File=/dev/nvidia0 >> NodeName=heimdall Name=gpu Type=a5000 File=/dev/nvidia1 >> NodeName=heimdall Name=gpu Type=a5000 File=/dev/nvidia2 >> NodeName=heimdall Name=gpu Type=a5000 File=/dev/nvidia3 >> NodeName=heimdall Name=gpu Type=a5000 File=/dev/nvidia4 >> >> >> # slurm.conf file generated by configurator.html. >> # Put this file on all nodes of your cluster. >> # See the slurm.conf man page for more information. >> # >> SlurmdDebug=debug2 >> # >> ClusterName=heimdall >> SlurmctldHost=localhost >> MpiDefault=none >> ProctrackType=proctrack/linuxproc >> ReturnToService=2 >> SlurmctldPidFile=/var/run/slurmctld.pid >> SlurmctldPort=6817 >> SlurmdPidFile=/var/run/slurmd.pid >> SlurmdPort=6818 >> SlurmdSpoolDir=/var/lib/slurm/slurmd >> SlurmUser=slurm >> StateSaveLocation=/var/lib/slurm/slurmctld >> SwitchType=switch/none >> TaskPlugin=task/none >> # >> # TIMERS >> InactiveLimit=0 >> KillWait=30 >> MinJobAge=300 >> SlurmctldTimeout=120 >> SlurmdTimeout=300 >> Waittime=0 >> # SCHEDULING >> SchedulerType=sched/backfill >> SelectType=select/cons_tres >> SelectTypeParameters=CR_Core >> GresTypes=gpu >> # >> #AccountingStoragePort= >> AccountingStorageType=accounting_storage/none >> JobCompType=jobcomp/none >> JobAcctGatherFrequency=30 >> JobAcctGatherType=jobacct_gather/none >> SlurmctldDebug=info >> SlurmctldLogFile=/var/log/slurm/slurmctld.log >> SlurmdDebug=info >> SlurmdLogFile=/var/log/slurm/slurmd.log >> # >> # COMPUTE NODES >> NodeName=heimdall CPUs=128 Gres=gpu:l40:1,gpu:a5000:4 Boards=1 >> SocketsPerBoard=2 CoresPerSocket=32 ThreadsPerCore=2 RealMemory=773635 >> State=UNKNOWN >> PartitionName=heimdall Nodes=ALL Default=YES MaxTime=INFINITE State=UP >> DefMemPerCPU=8000 DefCpuPerGPU=16 >> >> >