Re: [slurm-users] not allocating jobs even resources are free

Brian W. Johanson Fri, 24 Apr 2020 11:21:46 -0700

Without seeing the jobs in your queue, I would expect the next job inFIFO order to be too large to fit in the current idle resources.


Configure it to use the backfill scheduler: SchedulerType=sched/backfill


      SchedulerType

Identifies the type of scheduler to be used. Note theslurmctld daemon must be restarted for a change in scheduler type tobecome effective (reconfiguring a running daemon has no effect for thisparameter). The scontrol command can be used to manually change jobpriorities if desired. Acceptable values include:


              sched/backfill

For a backfill scheduling module to augment thedefault FIFO scheduling. Backfill scheduling will initiatelower-priority jobs if doing so does not delay the expected initiationtime of any higher priority job. Effectiveness of backfillscheduling is dependent upon users specifying job time limits, otherwiseall jobs will have the same time limit and backfilling is impossible. Note documentation for the SchedulerParameters option above. This isthe default configuration.


              sched/builtin

This is the FIFO scheduler which initiates jobsin priority order. If any job in the partition can not be scheduled, nolower priority job in that partition will be scheduled. An exception ismade for jobs that can not run due to partition constraints (e.g. thetime limit) or down/drained nodes. In that case, lower priority jobscan be initiated and not impact the higher priority job.

Your partitions are set with maxtime=INFINITE, if your users are notspecifying a reasonable timelimit to their jobs, this won't help either.



-b


On 4/24/20 1:52 PM, navin srivastava wrote:

In addition to the above when i see the sprio of both the jobs it says :-

for normal queue jobs all jobs showing the same priority

 JOBID PARTITION   PRIORITY  FAIRSHARE
        1291352 normal           15789      15789

for GPUsmall all jobs showing the same priority.

 JOBID PARTITION   PRIORITY  FAIRSHARE
        1291339 GPUsmall      21052      21053

On Fri, Apr 24, 2020 at 11:14 PM navin srivastava<navin.alt...@gmail.com <mailto:navin.alt...@gmail.com>> wrote:


    Hi Team,

    we are facing some issue in our environment. The resources are
    free but job is going into the QUEUE state but not running.

    i have attached the slurm.conf file here.

    scenario:-

    There are job only in the 2 partitions:
     344 jobs are in PD state in normal partition and the node belongs
    from the normal partitions are full and no more job can run.

    1300 JOBS are in GPUsmall partition are in queue and enough CPU is
    avaiable to execute the jobs but i see the jobs are not
    scheduling on free nodes.

    Rest there are no pend jobs in any other partition .
    eg:-
    node status:- node18

    NodeName=node18 Arch=x86_64 CoresPerSocket=18
       CPUAlloc=6 CPUErr=0 CPUTot=36 CPULoad=4.07
       AvailableFeatures=K2200
       ActiveFeatures=K2200
       Gres=gpu:2
       NodeAddr=node18 NodeHostName=node18 Version=17.11
       OS=Linux 4.4.140-94.42-default #1 SMP Tue Jul 17 07:44:50 UTC
    2018 (0b375e4)
       RealMemory=1 AllocMem=0 FreeMem=79532 Sockets=2 Boards=1
       State=MIXED ThreadsPerCore=1 TmpDisk=0 Weight=1 Owner=N/A
    MCS_label=N/A
       Partitions=GPUsmall,pm_shared
       BootTime=2019-12-10T14:16:37 SlurmdStartTime=2019-12-10T14:24:08
       CfgTRES=cpu=36,mem=1M,billing=36
       AllocTRES=cpu=6
       CapWatts=n/a
       CurrentWatts=0 LowestJoules=0 ConsumedJoules=0
       ExtSensorsJoules=n/s ExtSensorsWatts=0 ExtSensorsTemp=n/s

    node19:-

    NodeName=node19 Arch=x86_64 CoresPerSocket=18
       CPUAlloc=16 CPUErr=0 CPUTot=36 CPULoad=15.43
       AvailableFeatures=K2200
       ActiveFeatures=K2200
       Gres=gpu:2
       NodeAddr=node19 NodeHostName=node19 Version=17.11
       OS=Linux 4.12.14-94.41-default #1 SMP Wed Oct 31 12:25:04 UTC
    2018 (3090901)
       RealMemory=1 AllocMem=0 FreeMem=63998 Sockets=2 Boards=1
       State=MIXED ThreadsPerCore=1 TmpDisk=0 Weight=1 Owner=N/A
    MCS_label=N/A
       Partitions=GPUsmall,pm_shared
       BootTime=2020-03-12T06:51:54 SlurmdStartTime=2020-03-12T06:53:14
       CfgTRES=cpu=36,mem=1M,billing=36
       AllocTRES=cpu=16
       CapWatts=n/a
       CurrentWatts=0 LowestJoules=0 ConsumedJoules=0
       ExtSensorsJoules=n/s ExtSensorsWatts=0 ExtSensorsTemp=n/s

    could you please help me to understand what could be the reason?

Re: [slurm-users] not allocating jobs even resources are free

Reply via email to