I don't have access to a cluster right now so can't test this,
but possibly tres_alloc
squeue -O JobID,Partition,Name,tres_alloc,NodeList
-j <job>
might give some more info.
On 04/02/2021 17:01, Thomas Zeiser
wrote:
Dear All, we are running Slurm-20.02.6 and using "SelectType=select/cons_tres" with "SelectTypeParameters=CR_Core_Memory", "TaskPlugin=task/cgroup", and "ProctrackType=proctrack/cgroup". Nodes can be shared between multiple jobs with the partition defaults "ExclusiveUser=no OverSubscribe=No"For monitoring purpose, we'd like to know on the ControlMachine which cores of a batch node are assigned to a specific job. Is there any way (except looking on each batch node itself into /sys/fs/cgroup/cpuset/slurm_*) to get the assigned core ranges or GPU IDs? E.g. from Torque we are used that qstat tells the assigned cores. However, with Slurm, even "scontrol show job JOBID" does not seem to have any information in that direction. Knowing which GPU is allocated (in case of gres/gpu) of course also would be interested to know on the ControlMachine. Here's the output we get from scontrol show job; it has the node name and the number of cores assigned but not the "core IDs" (e.g. 32-63) JobId=886 JobName=br-14 UserId=hpc114(1356) GroupId=hpc1(1355) MCS_label=N/A Priority=1010 Nice=0 Account=hpc1 QOS=normal WCKey=* JobState=RUNNING Reason=None Dependency=(null) Requeue=0 Restarts=0 BatchFlag=1 Reboot=0 ExitCode=0:0 RunTime=00:40:09 TimeLimit=1-00:00:00 TimeMin=N/A SubmitTime=2021-02-04T07:26:51 EligibleTime=2021-02-04T07:26:51 AccrueTime=2021-02-04T07:26:51 StartTime=2021-02-04T07:26:54 EndTime=2021-02-05T07:26:54 Deadline=N/A PreemptEligibleTime=2021-02-04T07:26:54 PreemptTime=None SuspendTime=None SecsPreSuspend=0 LastSchedEval=2021-02-04T07:26:54 Partition=a100 AllocNode:Sid=gpu001:1743663 ReqNodeList=(null) ExcNodeList=(null) NodeList=gpu001 BatchHost=gpu001 NumNodes=1 NumCPUs=32 NumTasks=1 CPUs/Task=1 ReqB:S:C:T=0:0:*:* TRES=cpu=32,mem=120000M,node=1,billing=32,gres/gpu=1,gres/gpu:a100=1 Socks/Node=* NtasksPerN:B:S:C=0:0:*:* CoreSpec=* MinCPUsNode=1 MinMemoryCPU=3750M MinTmpDiskNode=0 Features=(null) DelayBoot=00:00:00 OverSubscribe=OK Contiguous=0 Licenses=(null) Network=(null) Command=/var/tmp/slurmd_spool/job00877/slurm_script WorkDir=/home/hpc114/run2 StdErr=/home/hpc114//run2/br-14.o886 StdIn=/dev/null StdOut=/home/hpc114/run2/br-14.o886 Power= TresPerNode=gpu:a100:1 MailUser=(null) MailType=NONE Also "scontrol show node" is not helpful NodeName=gpu001 Arch=x86_64 CoresPerSocket=64 CPUAlloc=128 CPUTot=128 CPULoad=4.09 AvailableFeatures=hwperf ActiveFeatures=hwperf Gres=gpu:a100:4(S:0-1) NodeAddr=gpu001 NodeHostName=gpu001 Port=6816 Version=20.02.6 OS=Linux 5.4.0-62-generic #70-Ubuntu SMP Tue Jan 12 12:45:47 UTC 2021 RealMemory=510000 AllocMem=480000 FreeMem=495922 Sockets=2 Boards=1 State=ALLOCATED ThreadsPerCore=2 TmpDisk=0 Weight=80 Owner=N/A MCS_label=N/A Partitions=a100 BootTime=2021-01-27T16:03:48 SlurmdStartTime=2021-02-03T13:43:05 CfgTRES=cpu=128,mem=510000M,billing=128,gres/gpu=4,gres/gpu:a100=4 AllocTRES=cpu=128,mem=480000M,gres/gpu=4,gres/gpu:a100=4 CapWatts=n/a CurrentWatts=0 AveWatts=0 ExtSensorsJoules=n/s ExtSensorsWatts=0 ExtSensorsTemp=n/s There is no information on the currently running four jobs included; neither which share of the allocated node is assigned to the individual jobs. I'd like to see isomehow that job 886 got cores 32-63,160-191 assigned as seen on the node from /sys/fs/cgroup %cat /sys/fs/cgroup/cpuset/slurm_gpu001/uid_1356/job_886/cpuset.cpus 32-63,160-191 Thanks for any ideas! Thomas Zeiser |
- [slurm-users] How to determine (on the ControlMachine) which... Thomas Zeiser
- Re: [slurm-users] [EXT] How to determine (on the Contro... Sean Crosby
- Re: [slurm-users] [EXT] How to determine (on the Co... Thomas Zeiser
- Re: [slurm-users] How to determine (on the ControlMachi... Daniel Letai