Re: [slurm-users] Backfill isn’t working for a node with two GPUs that have different GRES types.

Marcus Wagner Mon, 01 Apr 2019 23:24:51 -0700

Dear Randall,

could you please also provide



scontrol -d show node computelab-134
scontrol -d show job 100091
scontrol -d show job 100094


Best
Marcus

On 4/1/19 4:31 PM, Randall Radmer wrote:

I can’t get backfill to work for a machine with two GPUs (one is a P4and the other a T4).
Submitting jobs works as expected: if the GPU I request is free, thenmy job runs, otherwise it goes into a pending state. But if I havepending jobs for one GPU ahead of pending jobs for the other GPU, Isee blocking issues.
More specifically, I can create a case where I am running a job oneach of the GPUs and have a pending job waiting for the P4 followed bya pending job waiting for a T4. I would expect that if I exit therunning T4 job, then backfill would start the pending T4 job, eventhough it has to job ahead of the pending P4 job. This does not happen...
The following shows my jobs after I exited from a running T4 job,which had ID 100092:
$ squeue --noheader -u rradmer--Format=jobid,state,gres,nodelist,reason | sed 's/ */ /g' | sort
100091 RUNNING gpu:gv100:1 computelab-134 None

100093 PENDING gpu:gv100:1 Resources

100094 PENDING gpu:tu104:1 Resources
I can find no reason why 100094 doesn’t start running (I’ve waited upto an hour, just to make sure).
System config info and log snippets shown below.


Thanks much,

Randy


Node state corresponding to the squeue command, shown above:

$ scontrol show node computelab-134 | grep -i [gt]res

  Gres=gpu:gv100:1,gpu:tu104:1

  
CfgTRES=cpu=12,mem=64307M,billing=12,gres/gpu=2,gres/gpu:gv100=1,gres/gpu:tu104=1

  AllocTRES=cpu=6,mem=32148M,gres/gpu=1,gres/gpu:gv100=1



Slurm config follows:

$ scontrol show conf | grep -Ei '(gres|^Sched|prio|vers)'
AccountingStorageTRES =cpu,mem,energy,node,billing,gres/gpu,gres/gpu:gp100,gres/gpu:gp104,gres/gpu:gv100,gres/gpu:tu102,gres/gpu:tu104,gres/gpu:tu106
GresTypes               = gpu

PriorityParameters      = (null)

PriorityDecayHalfLife   = 7-00:00:00

PriorityCalcPeriod      = 00:05:00

PriorityFavorSmall      = No

PriorityFlags           =

PriorityMaxAge          = 7-00:00:00

PriorityUsageResetPeriod = NONE

PriorityType            = priority/multifactor

PriorityWeightAge       = 0

PriorityWeightFairShare = 0

PriorityWeightJobSize   = 0

PriorityWeightPartition = 0

PriorityWeightQOS       = 0

PriorityWeightTRES      = (null)

PropagatePrioProcess    = 0
SchedulerParameters =default_queue_depth=2000,bf_continue,bf_ignore_newly_avail_nodes,bf_max_job_test=1000,bf_window=10080,kill_invalid_depend
SchedulerTimeSlice      = 30 sec

SchedulerType           = sched/backfill

SLURM_VERSION           = 17.11.9-2


GPUs on node:

$ nvidia-smi --query-gpu=index,name,gpu_bus_id --format=csv

index, name, pci.bus_id

0, Tesla T4, 00000000:82:00.0

1, Tesla P4, 00000000:83:00.0

The gres file on node:

$ cat /etc/slurm/gres.conf

Name=gpu Type=tu104 File=/dev/nvidia0 Cores=0,1,2,3,4,5

Name=gpu Type=gp104 File=/dev/nvidia1 Cores=6,7,8,9,10,11


Random sample of SlurmSchedLogFile:

$ sudo tail -3 slurm.sched.log

[2019-04-01T08:14:23.727] sched: Running job scheduler
[2019-04-01T08:14:23.728] sched: JobId=100093. State=PENDING.Reason=Resources. Priority=1. Partition=test-backfill.
[2019-04-01T08:14:23.728] sched: JobId=100094. State=PENDING.Reason=Resources. Priority=1. Partition=test-backfill.
Random sample of SlurmctldLogFile:

$ sudo grep backfill slurmctld.log  | tail -5

[2019-04-01T08:16:53.281] backfill: beginning
[2019-04-01T08:16:53.281] backfill test for JobID=100093 Prio=1Partition=test-backfill
[2019-04-01T08:16:53.281] backfill test for JobID=100094 Prio=1Partition=test-backfill
[2019-04-01T08:16:53.281] backfill: reached end of job queue

[2019-04-01T08:16:53.281] backfill: completed testing 2(2) jobs, usec=707


--
Marcus Wagner, Dipl.-Inf.

IT Center
Abteilung: Systeme und Betrieb
RWTH Aachen University
Seffenter Weg 23
52074 Aachen
Tel: +49 241 80-24383
Fax: +49 241 80-624383
wag...@itc.rwth-aachen.de
www.itc.rwth-aachen.de

Re: [slurm-users] Backfill isn’t working for a node with two GPUs that have different GRES types.

Reply via email to