Re: [slurm-users] slurm jobs are pending but resources are available

Michael Di Domenico Mon, 16 Apr 2018 09:53:33 -0700

On Mon, Apr 16, 2018 at 6:35 AM,  <marius.cetate...@sony.com> wrote:
>
> According to the above I have the backfill scheduler enabled with CPUs and 
> Memory configured as
> resources. I have 56 CPUs and 256GB of RAM in my resource pool. I would 
> expect that he backfill
>scheduler attempts to allocate the resources in order to fill as much of the 
>cores as possible if there
> are multiple processes asking for more resources than available. In my case I 
> have the following queue:
>
> I'm going through the documentation again and again but I cannot figure out 
> what am I doing wrong ...
> Why do I have the above situation? What should I change to my config to make 
> this work?
>
> scontrol show -dd job <jobid> shows me the following:
>
> JobId=2361 JobName=training_carlib
>    UserId=mcetateanu(1000) GroupId=mcetateanu(1001) MCS_label=N/A
>    Priority=4294901726 Nice=0 Account=(null) QOS=(null)
>    JobState=PENDING Reason=Resources Dependency=(null)
>    Requeue=1 Restarts=0 BatchFlag=1 Reboot=0 ExitCode=0:0
>    RunTime=00:00:00 TimeLimit=3-04:00:00 TimeMin=N/A
>    SubmitTime=2018-03-27T10:30:38 EligibleTime=2018-03-27T10:30:38
>    StartTime=2018-03-28T10:27:36 EndTime=2018-03-31T14:27:36 Deadline=N/A
>    PreemptTime=None SuspendTime=None SecsPreSuspend=0
>    Partition=main_compute AllocNode:Sid=zalmoxis:23690
>    ReqNodeList=(null) ExcNodeList=(null)
>    NodeList=(null) SchedNodeList=cn_burebista
>    NumNodes=1 NumCPUs=20 NumTasks=1 CPUs/Task=20 ReqB:S:C:T=0:0:*:*
>    TRES=cpu=20,node=1
>    Socks/Node=* NtasksPerN:B:S:C=0:0:*:* CoreSpec=*
>    MinCPUsNode=20 MinMemoryNode=0 MinTmpDiskNode=0
>    Features=(null) Gres=(null) Reservation=(null)
>    OverSubscribe=OK Contiguous=0 Licenses=(null) Network=(null)
>    
> Command=/home/mcetateanu/workspace/CarLib/src/_outputs/linux-xeon_e5v4-icc17.0/bin/classifier/train_classifier.sh
>    
> WorkDir=/home/mcetateanu/workspace/CarLib/src/_outputs/linux-xeon_e5v4-icc17.0/bin/classifier
>    
> StdErr=/home/mcetateanu/workspace/CarLib/src/_outputs/linux-xeon_e5v4-icc17.0/bin/classifier/training_job_2383.out
>    StdIn=/dev/null
>    StdOut=/home/mcetateanu/workspace/CarLib/src/_out



perhaps i missed something in the email, but it sounds like you have
56 cores, you have two running jobs that consume 52 cores, leaving you
four free.  then a third job came along and requested 20 cores (based
on the the show job output).  slurm doesn't overcommit resources, so a
20 cpu job will not fit if there are only four cpus free

Re: [slurm-users] slurm jobs are pending but resources are available

Reply via email to