Thanks for your suggestions Marcus. I have restarted services, and also been messing with various parameters (probably more than I should). Nothing seems to help.
Not ready to upgrade to Slurm 18, so guess I'll have to live with it... Best, Randy On Wed, Apr 3, 2019 at 1:50 AM Marcus Wagner <wag...@itc.rwth-aachen.de> wrote: > Hmm..., > > I'm a bit dazzled, seems to be ok as far as I can tell. > > Did you try to restart slurmctld? > I had a case, where users could not submit to the default partition > anymore, since SLURM told them (if I remember right) > wrong account/partition combination > or something like that. > My first suspicion was my submission script since I changed it recently, > but I could not find any error. scontrol reconfig did not help. > But everything went well again, after I restarted the slurmctld. > > Might be worth a try. > > > Best > Marcus > > On 4/2/19 1:24 PM, Randall Radmer wrote: > > Hi Marcus, > > Following jobs are running or pending after I killed job 100816, which was > running on computelab-134's T4: > 100815 RUNNING computelab-134 gpu:gv100:1 None1 > 100817 PENDING gpu:gv100:1 Resources1 > 100818 PENDING gpu:tu104:1 Resources1 > > $ scontrol -d show node computelab-134 > NodeName=computelab-134 Arch=x86_64 CoresPerSocket=6 > CPUAlloc=6 CPUErr=0 CPUTot=12 CPULoad=0.00 > AvailableFeatures=(null) > ActiveFeatures=(null) > Gres=gpu:gv100:1,gpu:tu104:1 > GresDrain=N/A > GresUsed=gpu:gv100:1(IDX:0),gpu:tu104:0(IDX:N/A) > NodeAddr=computelab-134 NodeHostName=computelab-134 Version=17.11 > OS=Linux 4.4.0-143-generic #169-Ubuntu SMP Thu Feb 7 07:56:38 UTC 2019 > RealMemory=64307 AllocMem=32148 FreeMem=61126 Sockets=2 Boards=1 > State=MIXED ThreadsPerCore=1 TmpDisk=404938 Weight=1 Owner=N/A > MCS_label=N/A > Partitions=test-backfill > BootTime=2019-03-29T12:09:25 SlurmdStartTime=2019-04-01T11:34:35 > > > CfgTRES=cpu=12,mem=64307M,billing=12,gres/gpu=2,gres/gpu:gv100=1,gres/gpu:tu104=1 > AllocTRES=cpu=6,mem=32148M,gres/gpu=1,gres/gpu:gv100=1 > CapWatts=n/a > CurrentWatts=0 LowestJoules=0 ConsumedJoules=0 > ExtSensorsJoules=n/s ExtSensorsWatts=0 ExtSensorsTemp=n/s > > $ scontrol -d show job 100815 > JobId=100815 JobName=bash > UserId=rradmer(27578) GroupId=hardware(30) MCS_label=N/A > Priority=1 Nice=0 Account=cag QOS=normal > JobState=RUNNING Reason=None Dependency=(null) > Requeue=1 Restarts=0 BatchFlag=0 Reboot=0 ExitCode=0:0 > DerivedExitCode=0:0 > RunTime=00:06:45 TimeLimit=02:00:00 TimeMin=N/A > SubmitTime=2019-04-02T05:13:05 EligibleTime=2019-04-02T05:13:05 > StartTime=2019-04-02T05:13:05 EndTime=2019-04-02T07:13:05 Deadline=N/A > PreemptTime=None SuspendTime=None SecsPreSuspend=0 > LastSchedEval=2019-04-02T05:13:05 > Partition=test-backfill AllocNode:Sid=computelab-frontend-02:7873 > ReqNodeList=computelab-134 ExcNodeList=(null) > NodeList=computelab-134 > BatchHost=computelab-134 > NumNodes=1 NumCPUs=6 NumTasks=1 CPUs/Task=6 ReqB:S:C:T=0:0:*:* > TRES=cpu=6,mem=32148M,node=1,billing=6,gres/gpu=1,gres/gpu:gv100=1 > Socks/Node=* NtasksPerN:B:S:C=0:0:*:* CoreSpec=* > Nodes=computelab-134 CPU_IDs=0-5 Mem=32148 GRES_IDX=gpu:gv100(IDX:0) > MinCPUsNode=6 MinMemoryNode=32148M MinTmpDiskNode=0 > Features=(null) DelayBoot=00:00:00 > Gres=gpu:gv100:1 Reservation=(null) > OverSubscribe=OK Contiguous=0 Licenses=(null) Network=(null) > Command=/bin/bash > WorkDir=/home/rradmer > Power= > > $ scontrol -d show job 100817 > JobId=100817 JobName=bash > UserId=rradmer(27578) GroupId=hardware(30) MCS_label=N/A > Priority=1 Nice=0 Account=cag QOS=normal > JobState=PENDING Reason=Resources Dependency=(null) > Requeue=1 Restarts=0 BatchFlag=0 Reboot=0 ExitCode=0:0 > DerivedExitCode=0:0 > RunTime=00:00:00 TimeLimit=02:00:00 TimeMin=N/A > SubmitTime=2019-04-02T05:13:11 EligibleTime=2019-04-02T05:13:11 > StartTime=2019-04-02T07:13:05 EndTime=2019-04-02T09:13:05 Deadline=N/A > PreemptTime=None SuspendTime=None SecsPreSuspend=0 > LastSchedEval=2019-04-02T05:20:44 > Partition=test-backfill AllocNode:Sid=computelab-frontend-03:21736 > ReqNodeList=computelab-134 ExcNodeList=(null) > NodeList=(null) SchedNodeList=computelab-134 > NumNodes=1-1 NumCPUs=6 NumTasks=1 CPUs/Task=6 ReqB:S:C:T=0:0:*:* > TRES=cpu=6,mem=32148M,node=1,gres/gpu=1,gres/gpu:gv100=1 > Socks/Node=* NtasksPerN:B:S:C=0:0:*:* CoreSpec=* > MinCPUsNode=6 MinMemoryNode=32148M MinTmpDiskNode=0 > Features=(null) DelayBoot=00:00:00 > Gres=gpu:gv100:1 Reservation=(null) > OverSubscribe=OK Contiguous=0 Licenses=(null) Network=(null) > Command=/bin/bash > WorkDir=/home/rradmer > Power= > > $ scontrol -d show job 100818 > JobId=100818 JobName=bash > UserId=rradmer(27578) GroupId=hardware(30) MCS_label=N/A > Priority=1 Nice=0 Account=cag QOS=normal > JobState=PENDING Reason=Resources Dependency=(null) > Requeue=1 Restarts=0 BatchFlag=0 Reboot=0 ExitCode=0:0 > DerivedExitCode=0:0 > RunTime=00:00:00 TimeLimit=02:00:00 TimeMin=N/A > SubmitTime=2019-04-02T05:13:12 EligibleTime=2019-04-02T05:13:12 > StartTime=2019-04-02T09:13:00 EndTime=2019-04-02T11:13:00 Deadline=N/A > PreemptTime=None SuspendTime=None SecsPreSuspend=0 > LastSchedEval=2019-04-02T05:21:32 > Partition=test-backfill AllocNode:Sid=computelab-frontend-02:12826 > ReqNodeList=computelab-134 ExcNodeList=(null) > NodeList=(null) SchedNodeList=computelab-134 > NumNodes=1-1 NumCPUs=6 NumTasks=1 CPUs/Task=6 ReqB:S:C:T=0:0:*:* > TRES=cpu=6,mem=32148M,node=1,gres/gpu=1,gres/gpu:tu104=1 > Socks/Node=* NtasksPerN:B:S:C=0:0:*:* CoreSpec=* > MinCPUsNode=6 MinMemoryNode=32148M MinTmpDiskNode=0 > Features=(null) DelayBoot=00:00:00 > Gres=gpu:tu104:1 Reservation=(null) > OverSubscribe=OK Contiguous=0 Licenses=(null) Network=(null) > Command=/bin/bash > WorkDir=/home/rradmer > Power= > > > On Mon, Apr 1, 2019 at 11:24 PM Marcus Wagner <wag...@itc.rwth-aachen.de> > wrote: > >> Dear Randall, >> >> could you please also provide >> >> >> scontrol -d show node computelab-134 >> scontrol -d show job 100091 >> scontrol -d show job 100094 >> >> >> Best >> Marcus >> >> On 4/1/19 4:31 PM, Randall Radmer wrote: >> >> I can’t get backfill to work for a machine with two GPUs (one is a P4 and >> the other a T4). >> >> Submitting jobs works as expected: if the GPU I request is free, then my >> job runs, otherwise it goes into a pending state. But if I have pending >> jobs for one GPU ahead of pending jobs for the other GPU, I see blocking >> issues. >> >> More specifically, I can create a case where I am running a job on each >> of the GPUs and have a pending job waiting for the P4 followed by a pending >> job waiting for a T4. I would expect that if I exit the running T4 job, >> then backfill would start the pending T4 job, even though it has to job >> ahead of the pending P4 job. This does not happen... >> >> The following shows my jobs after I exited from a running T4 job, which >> had ID 100092: >> >> $ squeue --noheader -u rradmer --Format=jobid,state,gres,nodelist,reason >> | sed 's/ */ /g' | sort >> >> 100091 RUNNING gpu:gv100:1 computelab-134 None >> >> 100093 PENDING gpu:gv100:1 Resources >> >> 100094 PENDING gpu:tu104:1 Resources >> >> I can find no reason why 100094 doesn’t start running (I’ve waited up to >> an hour, just to make sure). >> >> System config info and log snippets shown below. >> >> Thanks much, >> >> Randy >> >> Node state corresponding to the squeue command, shown above: >> >> $ scontrol show node computelab-134 | grep -i [gt]res >> >> Gres=gpu:gv100:1,gpu:tu104:1 >> >> >> >> CfgTRES=cpu=12,mem=64307M,billing=12,gres/gpu=2,gres/gpu:gv100=1,gres/gpu:tu104=1 >> >> AllocTRES=cpu=6,mem=32148M,gres/gpu=1,gres/gpu:gv100=1 >> >> >> Slurm config follows: >> >> $ scontrol show conf | grep -Ei '(gres|^Sched|prio|vers)' >> >> AccountingStorageTRES = >> cpu,mem,energy,node,billing,gres/gpu,gres/gpu:gp100,gres/gpu:gp104,gres/gpu:gv100,gres/gpu:tu102,gres/gpu:tu104,gres/gpu:tu106 >> >> GresTypes = gpu >> >> PriorityParameters = (null) >> >> PriorityDecayHalfLife = 7-00:00:00 >> >> PriorityCalcPeriod = 00:05:00 >> >> PriorityFavorSmall = No >> >> PriorityFlags = >> >> PriorityMaxAge = 7-00:00:00 >> >> PriorityUsageResetPeriod = NONE >> >> PriorityType = priority/multifactor >> >> PriorityWeightAge = 0 >> >> PriorityWeightFairShare = 0 >> >> PriorityWeightJobSize = 0 >> >> PriorityWeightPartition = 0 >> >> PriorityWeightQOS = 0 >> >> PriorityWeightTRES = (null) >> >> PropagatePrioProcess = 0 >> >> SchedulerParameters = >> default_queue_depth=2000,bf_continue,bf_ignore_newly_avail_nodes,bf_max_job_test=1000,bf_window=10080,kill_invalid_depend >> >> SchedulerTimeSlice = 30 sec >> >> SchedulerType = sched/backfill >> >> SLURM_VERSION = 17.11.9-2 >> >> GPUs on node: >> >> $ nvidia-smi --query-gpu=index,name,gpu_bus_id --format=csv >> >> index, name, pci.bus_id >> >> 0, Tesla T4, 00000000:82:00.0 >> >> 1, Tesla P4, 00000000:83:00.0 >> >> The gres file on node: >> >> $ cat /etc/slurm/gres.conf >> >> Name=gpu Type=tu104 File=/dev/nvidia0 Cores=0,1,2,3,4,5 >> >> Name=gpu Type=gp104 File=/dev/nvidia1 Cores=6,7,8,9,10,11 >> >> Random sample of SlurmSchedLogFile: >> >> $ sudo tail -3 slurm.sched.log >> >> [2019-04-01T08:14:23.727] sched: Running job scheduler >> >> [2019-04-01T08:14:23.728] sched: JobId=100093. State=PENDING. >> Reason=Resources. Priority=1. Partition=test-backfill. >> >> [2019-04-01T08:14:23.728] sched: JobId=100094. State=PENDING. >> Reason=Resources. Priority=1. Partition=test-backfill. >> >> Random sample of SlurmctldLogFile: >> >> $ sudo grep backfill slurmctld.log | tail -5 >> >> [2019-04-01T08:16:53.281] backfill: beginning >> >> [2019-04-01T08:16:53.281] backfill test for JobID=100093 Prio=1 >> Partition=test-backfill >> >> [2019-04-01T08:16:53.281] backfill test for JobID=100094 Prio=1 >> Partition=test-backfill >> >> [2019-04-01T08:16:53.281] backfill: reached end of job queue >> >> [2019-04-01T08:16:53.281] backfill: completed testing 2(2) jobs, usec=707 >> >> >> -- >> Marcus Wagner, Dipl.-Inf. >> >> IT Center >> Abteilung: Systeme und Betrieb >> RWTH Aachen University >> Seffenter Weg 23 >> 52074 Aachen >> Tel: +49 241 80-24383 >> Fax: +49 241 80-624383wag...@itc.rwth-aachen.dewww.itc.rwth-aachen.de >> >> > -- > Marcus Wagner, Dipl.-Inf. > > IT Center > Abteilung: Systeme und Betrieb > RWTH Aachen University > Seffenter Weg 23 > 52074 Aachen > Tel: +49 241 80-24383 > Fax: +49 241 80-624383wag...@itc.rwth-aachen.dewww.itc.rwth-aachen.de > >