Hi Thomas, Indeed, even on my cluster, the CPU ID does not match the physical CPU assigned to the job
# scontrol show job 24115206_399 -d JobId=24115684 ArrayJobId=24115206 ArrayTaskId=399 JobName=s10 JOB_GRES=(null) Nodes=spartan-bm096 CPU_IDs=50 Mem=4000 GRES= [root@spartan-bm096 ~]# cat /sys/fs/cgroup/cpuset/slurm/uid_11470/job_24115684/cpuset.cpus 58 I will keep searching. I know we capture the real CPU ID as well, using daemons running on the worker nodes, and we feed that into Ganglia. Sean -- Sean Crosby | Senior DevOpsHPC Engineer and HPC Team Lead Research Computing Services | Business Services The University of Melbourne, Victoria 3010 Australia On Fri, 12 Feb 2021 at 06:15, Thomas Zeiser < thomas.zei...@rrze.uni-erlangen.de> wrote: > UoM notice: External email. Be cautious of links, attachments, or > impersonation attempts > > Hi Sean, > > unfortunately, the CPU_IDs and GPU IDX given by "scontrol -d show > job JOBID" are not related in any way to the ordering of the > hardware. It seems to be just the sequence of the cores / GPUs > assigned by Slurm. > > > For reference: The PCI-IDs of the GPUs when run as root outside of > any cgroup: > > | GPU Name Persistence-M| Bus-Id Disp.A | > | 0 A100-SXM4-40GB On | 00000000:01:00.0 Off | > | 1 A100-SXM4-40GB On | 00000000:41:00.0 Off | > | 2 A100-SXM4-40GB On | 00000000:81:00.0 Off | > | 3 A100-SXM4-40GB On | 00000000:C1:00.0 Off | > > > > I submitted a job requesting 1 GPU and 3 GPU to a node with 4 > GPUs. Both run concurrently. > > > Output of the 1st 1 GPU job: > > | 0 A100-SXM4-40GB On | 00000000:41:00.0 Off | > 0 | > SLURM_JOB_GPUS=0 > GPU_DEVICE_ORDINAL=0 > CUDA_VISIBLE_DEVICES=0 > Nodes=tg091 CPU_IDs=0-63 Mem=120000 GRES=gpu:a100:1(IDX:0) > /sys/fs/cgroup/cpuset/slurm_$(hostname -s)/uid_$(id > -u)/job_$SLURM_JOB_ID/step_batch/cpuset.cpus=64-95,192-223 > > I understand CUDA_VISIBLE_DEVICES=0 as that is within the cgroup. > However, 00000000:41:00.0 is by no means IDX0; it's only the 1st > GPU assigned on the node by Slurm. > CPU-IDs do not match the cpuset in any way. (CPUs are 2x 64 cores with SMT > enabled) > > > Output of the 2nd 3 GPU job running concurrently: > | 0 A100-SXM4-40GB On | 00000000:01:00.0 Off | > 0 | > | 1 A100-SXM4-40GB On | 00000000:81:00.0 Off | > 0 | > | 2 A100-SXM4-40GB On | 00000000:C1:00.0 Off | > 0 | > SLURM_JOB_GPUS=1,2,3 > GPU_DEVICE_ORDINAL=0,1,2 > CUDA_VISIBLE_DEVICES=0,1,2 > Nodes=tg091 CPU_IDs=64-255 Mem=360000 GRES=gpu:a100:3(IDX:1-3) > /sys/fs/cgroup/cpuset/slurm_$(hostname -s)/uid_$(id > -u)/job_$SLURM_JOB_ID/step_batch/cpuset.cpus=0-63,96-191,224-255 > > Again CUDA_VISIBLE_DEVICES=0,1,2 is reasonable within the cgroup. > However, IDX:1-3 or SLURM_JOB_GPUS=1,2,3 does not correspond to the > Bus-IDs which would be 0, 2, 3 according to the non-cgroup output. > Again, no relation between CPU-IDs and cpuset. > > > > If the jobs are started in reverse order: > > Output of the 3 GPU job started as first job on the node: > | 0 A100-SXM4-40GB On | 00000000:01:00.0 Off | > 0 | > | 1 A100-SXM4-40GB On | 00000000:41:00.0 Off | > 0 | > | 2 A100-SXM4-40GB On | 00000000:C1:00.0 Off | > 0 | > SLURM_JOB_GPUS=0,1,2 > GPU_DEVICE_ORDINAL=0,1,2 > CUDA_VISIBLE_DEVICES=0,1,2 > Nodes=tg091 CPU_IDs=0-191 Mem=360000 GRES=gpu:a100:3(IDX:0-2) > /sys/fs/cgroup/cpuset/slurm_$(hostname -s)/uid_$(id > -u)/job_$SLURM_JOB_ID/step_batch/cpuset.cpus=0-95,128-223 > > => IDX:0-2 does not correspond to the Bus-IDs which would be 0, 1, > 3 according to the non-cgroup output. > > > Output of the 1 GPU job started second but running concurrently: > | 0 A100-SXM4-40GB On | 00000000:81:00.0 Off | > 0 | > SLURM_JOB_GPUS=3 > GPU_DEVICE_ORDINAL=0 > CUDA_VISIBLE_DEVICES=0 > Nodes=tg091 CPU_IDs=192-255 Mem=120000 GRES=gpu:a100:1(IDX:3) > /sys/fs/cgroup/cpuset/slurm_$(hostname -s)/uid_$(id > -u)/job_$SLURM_JOB_ID/step_batch/cpuset.cpus=96-127,224-255 > > > If three jobs requesting 1, 2, and 1 GPU are submitted in that > order, it is even worse as the 2 GPU job will be assigned to the > 2nd socket while the last jobs will fill up the 1st socket. I can > clearly be seen that GRES=gpu:a100:2(IDX is just incremented but > not related to hardware location. > > | 0 A100-SXM4-40GB On | 00000000:41:00.0 Off | > 0 | > SLURM_JOB_GPUS=0 > GPU_DEVICE_ORDINAL=0 > CUDA_VISIBLE_DEVICES=0 > Nodes=tg094 CPU_IDs=0-63 Mem=120000 GRES=gpu:a100:1(IDX:0) > 0-31,128-159 > > > | 0 A100-SXM4-40GB On | 00000000:01:00.0 Off | > 0 | > | 1 A100-SXM4-40GB On | 00000000:C1:00.0 Off | > 0 | > SLURM_JOB_GPUS=1,2 > GPU_DEVICE_ORDINAL=0,1 > CUDA_VISIBLE_DEVICES=0,1 > Nodes=tg094 CPU_IDs=128-255 Mem=240000 GRES=gpu:a100:2(IDX:1-2) > 64-127,192-255 > > > | 0 A100-SXM4-40GB On | 00000000:81:00.0 Off | > 0 | > SLURM_JOB_GPUS=3 > GPU_DEVICE_ORDINAL=0 > CUDA_VISIBLE_DEVICES=0 > Nodes=tg094 CPU_IDs=64-127 Mem=120000 GRES=gpu:a100:1(IDX:3) > 32-63,160-191 > > > > Best regards > > thomas > > On Fri, Feb 05, 2021 at 07:37:37PM +1100, Sean Crosby wrote: > > Hi Thomas, > > > > Add the -d flag to scontrol show job > > > > e.g. > > > > # scontrol show job 23891862 -d > > JobId=23891862 JobName=SPI_DOWN > > UserId=user1(11283) GroupId=group1(10414) MCS_label=N/A > > Priority=586 Nice=0 Account=group1 QOS=qos1 > > JobState=RUNNING Reason=None Dependency=(null) > > Requeue=1 Restarts=0 BatchFlag=1 Reboot=0 ExitCode=0:0 > > DerivedExitCode=0:0 > > RunTime=2-00:13:58 TimeLimit=7-00:00:00 TimeMin=N/A > > SubmitTime=2021-02-03T19:19:28 EligibleTime=2021-02-03T19:19:28 > > AccrueTime=2021-02-03T19:19:31 > > StartTime=2021-02-03T19:19:31 EndTime=2021-02-10T19:19:31 Deadline=N/A > > SuspendTime=None SecsPreSuspend=0 LastSchedEval=2021-02-03T19:19:31 > > Partition=gpgpu AllocNode:Sid=spartan-login3:222306 > > ReqNodeList=(null) ExcNodeList=(null) > > NodeList=spartan-gpgpu007 > > BatchHost=spartan-gpgpu007 > > NumNodes=1 NumCPUs=6 NumTasks=1 CPUs/Task=6 ReqB:S:C:T=0:0:*:* > > TRES=cpu=6,mem=24000M,node=1,billing=101,gres/gpu=1 > > Socks/Node=* NtasksPerN:B:S:C=0:0:*:1 CoreSpec=* > > JOB_GRES=gpu:1 > > Nodes=spartan-gpgpu007 CPU_IDs=6-11 Mem=24000 GRES=gpu:1(IDX:1) > > MinCPUsNode=6 MinMemoryCPU=4000M MinTmpDiskNode=0 > > Features=(null) DelayBoot=00:00:00 > > OverSubscribe=OK Contiguous=0 Licenses=(null) Network=(null) > > > > Note the CPU_IDs and GPU IDX in the output > > > > Sean > > > > -- > > Sean Crosby | Senior DevOpsHPC Engineer and HPC Team Lead > > Research Computing Services | Business Services > > The University of Melbourne, Victoria 3010 Australia > > > > > > > > On Fri, 5 Feb 2021 at 02:01, Thomas Zeiser < > > thomas.zei...@rrze.uni-erlangen.de> wrote: > > > > > UoM notice: External email. Be cautious of links, attachments, or > > > impersonation attempts > > > > > > Dear All, > > > > > > we are running Slurm-20.02.6 and using > > > "SelectType=select/cons_tres" with > > > "SelectTypeParameters=CR_Core_Memory", "TaskPlugin=task/cgroup", > > > and "ProctrackType=proctrack/cgroup". Nodes can be shared between > > > multiple jobs with the partition defaults "ExclusiveUser=no > > > OverSubscribe=No" > > > > > > For monitoring purpose, we'd like to know on the ControlMachine > > > which cores of a batch node are assigned to a specific job. Is > > > there any way (except looking on each batch node itself into > > > /sys/fs/cgroup/cpuset/slurm_*) to get the assigned core ranges or > > > GPU IDs? > > > > > > E.g. from Torque we are used that qstat tells the assigned cores. > > > However, with Slurm, even "scontrol show job JOBID" does not seem > > > to have any information in that direction. > > > > > > Knowing which GPU is allocated (in case of gres/gpu) of course > > > also would be interested to know on the ControlMachine. > > > > > > > > > Here's the output we get from scontrol show job; it has the node > > > name and the number of cores assigned but not the "core IDs" (e.g. > > > 32-63) > > > > > > JobId=886 JobName=br-14 > > > UserId=hpc114(1356) GroupId=hpc1(1355) MCS_label=N/A > > > Priority=1010 Nice=0 Account=hpc1 QOS=normal WCKey=* > > > JobState=RUNNING Reason=None Dependency=(null) > > > Requeue=0 Restarts=0 BatchFlag=1 Reboot=0 ExitCode=0:0 > > > RunTime=00:40:09 TimeLimit=1-00:00:00 TimeMin=N/A > > > SubmitTime=2021-02-04T07:26:51 EligibleTime=2021-02-04T07:26:51 > > > AccrueTime=2021-02-04T07:26:51 > > > StartTime=2021-02-04T07:26:54 EndTime=2021-02-05T07:26:54 > Deadline=N/A > > > PreemptEligibleTime=2021-02-04T07:26:54 PreemptTime=None > > > SuspendTime=None SecsPreSuspend=0 LastSchedEval=2021-02-04T07:26:54 > > > Partition=a100 AllocNode:Sid=gpu001:1743663 > > > ReqNodeList=(null) ExcNodeList=(null) > > > NodeList=gpu001 > > > BatchHost=gpu001 > > > NumNodes=1 NumCPUs=32 NumTasks=1 CPUs/Task=1 ReqB:S:C:T=0:0:*:* > > > TRES=cpu=32,mem=120000M,node=1,billing=32,gres/gpu=1,gres/gpu:a100=1 > > > Socks/Node=* NtasksPerN:B:S:C=0:0:*:* CoreSpec=* > > > MinCPUsNode=1 MinMemoryCPU=3750M MinTmpDiskNode=0 > > > Features=(null) DelayBoot=00:00:00 > > > OverSubscribe=OK Contiguous=0 Licenses=(null) Network=(null) > > > Command=/var/tmp/slurmd_spool/job00877/slurm_script > > > WorkDir=/home/hpc114/run2 > > > StdErr=/home/hpc114//run2/br-14.o886 > > > StdIn=/dev/null > > > StdOut=/home/hpc114/run2/br-14.o886 > > > Power= > > > TresPerNode=gpu:a100:1 > > > MailUser=(null) MailType=NONE > > > > > > Also "scontrol show node" is not helpful > > > > > > NodeName=gpu001 Arch=x86_64 CoresPerSocket=64 > > > CPUAlloc=128 CPUTot=128 CPULoad=4.09 > > > AvailableFeatures=hwperf > > > ActiveFeatures=hwperf > > > Gres=gpu:a100:4(S:0-1) > > > NodeAddr=gpu001 NodeHostName=gpu001 Port=6816 Version=20.02.6 > > > OS=Linux 5.4.0-62-generic #70-Ubuntu SMP Tue Jan 12 12:45:47 UTC > 2021 > > > RealMemory=510000 AllocMem=480000 FreeMem=495922 Sockets=2 Boards=1 > > > State=ALLOCATED ThreadsPerCore=2 TmpDisk=0 Weight=80 Owner=N/A > > > MCS_label=N/A > > > Partitions=a100 > > > BootTime=2021-01-27T16:03:48 SlurmdStartTime=2021-02-03T13:43:05 > > > CfgTRES=cpu=128,mem=510000M,billing=128,gres/gpu=4,gres/gpu:a100=4 > > > AllocTRES=cpu=128,mem=480000M,gres/gpu=4,gres/gpu:a100=4 > > > CapWatts=n/a > > > CurrentWatts=0 AveWatts=0 > > > ExtSensorsJoules=n/s ExtSensorsWatts=0 ExtSensorsTemp=n/s > > > > > > There is no information on the currently running four jobs > > > included; neither which share of the allocated node is assigned to > > > the individual jobs. > > > > > > > > > I'd like to see isomehow that job 886 got cores 32-63,160-191 > > > assigned as seen on the node from /sys/fs/cgroup > > > > > > %cat /sys/fs/cgroup/cpuset/slurm_gpu001/uid_1356/job_886/cpuset.cpus > > > 32-63,160-191 > > > > > > > > > Thanks for any ideas! > > > > > > Thomas Zeiser > >