...and I'm not sure what "AutoDetect=NVML" is supposed to do in the gres.conf file. We've always used "nvidia-smi topo -m" to confirm that we've got a single-root or dual-root node and have entered the correct info in gres.conf to map connections to the CPU sockets...., e.g.:
# 8-gpu A6000 nodes - dual-root NodeName=gpu[1504-1506] Name=gpu Type=a6000 File=/dev/nvidia[0-3] CPUs=0-23 NodeName=gpu[1504-1506] Name=gpu Type=a6000 File=/dev/nvidia[4-7] CPUs=24-47 On Fri, Aug 20, 2021 at 6:01 PM Fulcomer, Samuel <samuel_fulco...@brown.edu> wrote: > Well... you've got lots of weirdness, as the scontrol show job command > isn't listing any GPU TRES requests, and the scontrol show node command > isn't listing any configured GPU TRES resources. > > If you send me your entire slurm.conf I'll have a quick look-over. > > You also should be using cgroup.conf to fence off the GPU devices so that > a job only sees the GPUs that it's been allocated. The lines in the batch > file to figure it out aren't necessary. I forgot to ask you about > cgroup.conf. > > regards, > Sam > > On Fri, Aug 20, 2021 at 5:46 PM Andrey Malyutin <malyuti...@gmail.com> > wrote: > >> Thank you Samuel, >> >> Slurm version is 20.02.6. I'm not entirely sure about the platform, >> RTX6000 nodes are about 2 years old, and 3090 node is very recent. >> Technically we have 4 nodes (hence references to node04 in info below), but >> one of the nodes is down and out of the system at the moment. As you see, >> the job really wants to run on the downed node instead of going to node02 >> or node03. >> >> Thank you again, >> Andrey >> >> >> >> *scontrol info:* >> >> JobId=283 JobName=cryosparc_P2_J214 >> >> UserId=cryosparc(1003) GroupId=cryosparc(1003) MCS_label=N/A >> >> Priority=4294901572 Nice=0 Account=(null) QOS=normal >> >> JobState=PENDING Reason=ReqNodeNotAvail,_UnavailableNodes:node04 >> Dependency=(null) >> >> Requeue=1 Restarts=0 BatchFlag=1 Reboot=0 ExitCode=0:0 >> >> RunTime=00:00:00 TimeLimit=UNLIMITED TimeMin=N/A >> >> SubmitTime=2021-08-20T20:55:00 EligibleTime=2021-08-20T20:55:00 >> >> AccrueTime=2021-08-20T20:55:00 >> >> StartTime=Unknown EndTime=Unknown Deadline=N/A >> >> SuspendTime=None SecsPreSuspend=0 LastSchedEval=2021-08-20T23:36:14 >> >> Partition=CSCluster AllocNode:Sid=headnode:108964 >> >> ReqNodeList=(null) ExcNodeList=(null) >> >> NodeList=(null) >> >> NumNodes=1 NumCPUs=4 NumTasks=4 CPUs/Task=1 ReqB:S:C:T=0:0:*:* >> >> TRES=cpu=4,mem=24000M,node=1,billing=4 >> >> Socks/Node=* NtasksPerN:B:S:C=0:0:*:* CoreSpec=* >> >> MinCPUsNode=1 MinMemoryNode=24000M MinTmpDiskNode=0 >> >> Features=(null) DelayBoot=00:00:00 >> >> OverSubscribe=NO Contiguous=0 Licenses=(null) Network=(null) >> >> >> Command=/data/backups/takeda2/data/cryosparc_projects/P8/J214/queue_sub_script.sh >> >> WorkDir=/ssd/CryoSparc/cryosparc_master >> >> StdErr=/data/backups/takeda2/data/cryosparc_projects/P8/J214/job.log >> >> StdIn=/dev/null >> >> StdOut=/data/backups/takeda2/data/cryosparc_projects/P8/J214/job.log >> >> Power= >> >> TresPerNode=gpu:1 >> >> MailUser=cryosparc MailType=NONE >> >> >> *Script:* >> >> #SBATCH --job-name cryosparc_P2_J214 >> >> #SBATCH -n 4 >> >> #SBATCH --gres=gpu:1 >> >> #SBATCH -p CSCluster >> >> #SBATCH --mem=24000MB >> >> #SBATCH >> --output=/data/backups/takeda2/data/cryosparc_projects/P8/J214/job.log >> >> #SBATCH >> --error=/data/backups/takeda2/data/cryosparc_projects/P8/J214/job.log >> >> >> >> available_devs="" >> >> for devidx in $(seq 0 15); >> >> do >> >> if [[ -z $(nvidia-smi -i $devidx --query-compute-apps=pid >> --format=csv,noheader) ]] ; then >> >> if [[ -z "$available_devs" ]] ; then >> >> available_devs=$devidx >> >> else >> >> available_devs=$available_devs,$devidx >> >> fi >> >> fi >> >> done >> >> export CUDA_VISIBLE_DEVICES=$available_devs >> >> >> >> /ssd/CryoSparc/cryosparc_worker/bin/cryosparcw run --project P2 --job >> J214 --master_hostname headnode.cm.cluster --master_command_core_port 39002 >> > /data/backups/takeda2/data/cryosparc_projects/P8/J214/job.log 2>&1 >> >> >> >> >> >> >> >> *Slurm.conf* >> >> # This section of this file was automatically generated by cmd. Do not >> edit manually! >> >> # BEGIN AUTOGENERATED SECTION -- DO NOT REMOVE >> >> # Server nodes >> >> SlurmctldHost=headnode >> >> AccountingStorageHost=master >> >> >> ############################################################################################# >> >> #GPU Nodes >> >> >> ############################################################################################# >> >> NodeName=node[02-04] Procs=64 CoresPerSocket=16 RealMemory=257024 >> Sockets=2 ThreadsPerCore=2 Feature=RTX6000 Gres=gpu:4 >> >> NodeName=node01 Procs=64 CoresPerSocket=16 RealMemory=386048 Sockets=2 >> ThreadsPerCore=2 Feature=RTX3090 Gres=gpu:4 >> >> #NodeName=node[05-08] Procs=8 Gres=gpu:4 >> >> # >> >> >> ############################################################################################# >> >> # Partitions >> >> >> ############################################################################################# >> >> PartitionName=defq Default=YES MinNodes=1 DefaultTime=UNLIMITED >> MaxTime=UNLIMITED AllowGroups=ALL PriorityJobFactor=1 PriorityTier=1 >> OverSubscribe=NO PreemptMode=OFF AllowAccounts=ALL AllowQos=ALL >> Nodes=node[01-04] >> >> PartitionName=CSLive MinNodes=1 DefaultTime=UNLIMITED MaxTime=UNLIMITED >> AllowGroups=ALL PriorityJobFactor=1 PriorityTier=1 OverSubscribe=NO >> PreemptMode=OFF AllowAccounts=ALL AllowQos=ALL Nodes=node01 >> >> PartitionName=CSCluster MinNodes=1 DefaultTime=UNLIMITED >> MaxTime=UNLIMITED AllowGroups=ALL PriorityJobFactor=1 PriorityTier=1 >> OverSubscribe=NO PreemptMode=OFF AllowAccounts=ALL AllowQos=ALL >> Nodes=node[02-04] >> >> ClusterName=slurm >> >> >> >> *Gres.conf* >> >> # This section of this file was automatically generated by cmd. Do not >> edit manually! >> >> # BEGIN AUTOGENERATED SECTION -- DO NOT REMOVE >> >> AutoDetect=NVML >> >> # END AUTOGENERATED SECTION -- DO NOT REMOVE >> >> #Name=gpu File=/dev/nvidia[0-3] Count=4 >> >> #Name=mic Count=0 >> >> >> >> *Sinfo:* >> >> PARTITION AVAIL TIMELIMIT NODES STATE NODELIST >> >> defq* up infinite 1 down* node04 >> >> defq* up infinite 3 idle node[01-03] >> >> CSLive up infinite 1 idle node01 >> >> CSCluster up infinite 1 down* node04 >> >> CSCluster up infinite 2 idle node[02-03] >> >> >> >> *Node1:* >> >> NodeName=node01 Arch=x86_64 CoresPerSocket=16 >> >> CPUAlloc=0 CPUTot=64 CPULoad=0.04 >> >> AvailableFeatures=RTX3090 >> >> ActiveFeatures=RTX3090 >> >> Gres=gpu:4 >> >> NodeAddr=node01 NodeHostName=node01 Version=20.02.6 >> >> OS=Linux 3.10.0-1160.11.1.el7.x86_64 #1 SMP Fri Dec 18 16:34:56 UTC >> 2020 >> >> RealMemory=386048 AllocMem=0 FreeMem=16665 Sockets=2 Boards=1 >> >> State=IDLE ThreadsPerCore=2 TmpDisk=0 Weight=1 Owner=N/A MCS_label=N/A >> >> Partitions=defq,CSLive >> >> BootTime=2021-08-04T13:59:08 SlurmdStartTime=2021-08-10T09:32:43 >> >> CfgTRES=cpu=64,mem=377G,billing=64 >> >> AllocTRES= >> >> CapWatts=n/a >> >> CurrentWatts=0 AveWatts=0 >> >> ExtSensorsJoules=n/s ExtSensorsWatts=0 ExtSensorsTemp=n/s >> >> >> >> *Node2-3* >> >> NodeName=node02 Arch=x86_64 CoresPerSocket=16 >> >> CPUAlloc=0 CPUTot=64 CPULoad=0.48 >> >> AvailableFeatures=RTX6000 >> >> ActiveFeatures=RTX6000 >> >> Gres=gpu:4(S:0-1) >> >> NodeAddr=node02 NodeHostName=node02 Version=20.02.6 >> >> OS=Linux 3.10.0-1160.11.1.el7.x86_64 #1 SMP Fri Dec 18 16:34:56 UTC >> 2020 >> >> RealMemory=257024 AllocMem=0 FreeMem=2259 Sockets=2 Boards=1 >> >> State=IDLE ThreadsPerCore=2 TmpDisk=0 Weight=1 Owner=N/A MCS_label=N/A >> >> Partitions=defq,CSCluster >> >> BootTime=2021-07-29T20:47:32 SlurmdStartTime=2021-08-10T09:32:55 >> >> CfgTRES=cpu=64,mem=251G,billing=64 >> >> AllocTRES= >> >> CapWatts=n/a >> >> CurrentWatts=0 AveWatts=0 >> >> ExtSensorsJoules=n/s ExtSensorsWatts=0 ExtSensorsTemp=n/s >> >> On Thu, Aug 19, 2021, 6:07 PM Fulcomer, Samuel <samuel_fulco...@brown.edu> >> wrote: >> >>> What SLURM version are you running? >>> >>> What are the #SLURM directives in the batch script? (or the sbatch >>> arguments) >>> >>> When the single GPU jobs are pending, what's the output of 'scontrol >>> show job JOBID'? >>> >>> What are the node definitions in slurm.conf, and the lines in gres.conf? >>> >>> Are the nodes all the same host platform (motherboard)? >>> >>> We have P100s, TitanVs, Titan RTXs, Quadro RTX 6000s, 3090s, V100s, DGX >>> 1s, A6000s, and A40s, with a mix of single and dual-root platforms, and >>> haven't seen this problem with SLURM 20.02.6 or earlier versions. >>> >>> On Thu, Aug 19, 2021 at 8:38 PM Andrey Malyutin <malyuti...@gmail.com> >>> wrote: >>> >>>> Hello, >>>> >>>> We are in the process of finishing up the setup of a cluster with 3 >>>> nodes, 4 GPUs each. One node has RTX3090s and the other 2 have RTX6000s.Any >>>> job asking for 1 GPU in the submission script will wait to run on the 3090 >>>> node, no matter resource availability. Same job requesting 2 or more GPUs >>>> will run on any node. I don't even know where to begin troubleshooting this >>>> issue; entries for the 3 nodes are effectively identical in slurm.conf. Any >>>> help would be appreciated. (If helpful - this cluster is used for >>>> structural biology, with cryosparc and relion packages). >>>> >>>> Thank you, >>>> Andrey >>>> >>>