On Mon, Apr 16, 2018 at 6:35 AM, <marius.cetate...@sony.com> wrote: > > According to the above I have the backfill scheduler enabled with CPUs and > Memory configured as > resources. I have 56 CPUs and 256GB of RAM in my resource pool. I would > expect that he backfill >scheduler attempts to allocate the resources in order to fill as much of the >cores as possible if there > are multiple processes asking for more resources than available. In my case I > have the following queue: > > I'm going through the documentation again and again but I cannot figure out > what am I doing wrong ... > Why do I have the above situation? What should I change to my config to make > this work? > > scontrol show -dd job <jobid> shows me the following: > > JobId=2361 JobName=training_carlib > UserId=mcetateanu(1000) GroupId=mcetateanu(1001) MCS_label=N/A > Priority=4294901726 Nice=0 Account=(null) QOS=(null) > JobState=PENDING Reason=Resources Dependency=(null) > Requeue=1 Restarts=0 BatchFlag=1 Reboot=0 ExitCode=0:0 > RunTime=00:00:00 TimeLimit=3-04:00:00 TimeMin=N/A > SubmitTime=2018-03-27T10:30:38 EligibleTime=2018-03-27T10:30:38 > StartTime=2018-03-28T10:27:36 EndTime=2018-03-31T14:27:36 Deadline=N/A > PreemptTime=None SuspendTime=None SecsPreSuspend=0 > Partition=main_compute AllocNode:Sid=zalmoxis:23690 > ReqNodeList=(null) ExcNodeList=(null) > NodeList=(null) SchedNodeList=cn_burebista > NumNodes=1 NumCPUs=20 NumTasks=1 CPUs/Task=20 ReqB:S:C:T=0:0:*:* > TRES=cpu=20,node=1 > Socks/Node=* NtasksPerN:B:S:C=0:0:*:* CoreSpec=* > MinCPUsNode=20 MinMemoryNode=0 MinTmpDiskNode=0 > Features=(null) Gres=(null) Reservation=(null) > OverSubscribe=OK Contiguous=0 Licenses=(null) Network=(null) > > Command=/home/mcetateanu/workspace/CarLib/src/_outputs/linux-xeon_e5v4-icc17.0/bin/classifier/train_classifier.sh > > WorkDir=/home/mcetateanu/workspace/CarLib/src/_outputs/linux-xeon_e5v4-icc17.0/bin/classifier > > StdErr=/home/mcetateanu/workspace/CarLib/src/_outputs/linux-xeon_e5v4-icc17.0/bin/classifier/training_job_2383.out > StdIn=/dev/null > StdOut=/home/mcetateanu/workspace/CarLib/src/_out
perhaps i missed something in the email, but it sounds like you have 56 cores, you have two running jobs that consume 52 cores, leaving you four free. then a third job came along and requested 20 cores (based on the the show job output). slurm doesn't overcommit resources, so a 20 cpu job will not fit if there are only four cpus free