If you haven't looked at the man page for slurm.conf, it will answer
most if not all your questions.
https://slurm.schedmd.com/slurm.conf.html but I would depend on the the
manual version that was distributed with the version you have installed
as options do change.
There is a ton of information that is tedious to get through but reading
through it multiple times opens many doors.
DefaultTime is listed in there as a Partition option.
If you are scheduling gres/gpu resources, it's quite possible there are
cores available with no corresponding gpus avail.
-b
On 4/24/20 2:49 PM, navin srivastava wrote:
Thanks Brian.
I need to check the jobs order.
Is there any way to define the default timeline of the job if user
not specifying time limit.
Also what does the meaning of fairtree in priorities in slurm.Conf file.
The set of nodes are different in partitions.FIFO does not care for
any partitiong.
Is it like strict odering means the job came 1st will go and until it
runs it will not allow others.
Also priorities is high for gpusmall partition and low for normal jobs
and the nodes of the normal partition is full but gpusmall cores are
available.
Regards
Navin
On Fri, Apr 24, 2020, 23:49 Brian W. Johanson <bjoha...@psc.edu
<mailto:bjoha...@psc.edu>> wrote:
Without seeing the jobs in your queue, I would expect the next job
in FIFO order to be too large to fit in the current idle resources.
Configure it to use the backfill scheduler:
SchedulerType=sched/backfill
SchedulerType
Identifies the type of scheduler to be used. Note
the slurmctld daemon must be restarted for a change in scheduler
type to become effective (reconfiguring a running daemon has no
effect for this parameter). The scontrol command can be used to
manually change job priorities if desired. Acceptable values include:
sched/backfill
For a backfill scheduling module to augment
the default FIFO scheduling. Backfill scheduling will initiate
lower-priority jobs if doing so does not delay the expected
initiation time of any higher priority job. Effectiveness of
backfill scheduling is dependent upon users specifying job time
limits, otherwise all jobs will have the same time limit and
backfilling is impossible. Note documentation for the
SchedulerParameters option above. This is the default configuration.
sched/builtin
This is the FIFO scheduler which initiates
jobs in priority order. If any job in the partition can not be
scheduled, no lower priority job in that partition will be
scheduled. An exception is made for jobs that can not run due to
partition constraints (e.g. the time limit) or down/drained
nodes. In that case, lower priority jobs can be initiated and not
impact the higher priority job.
Your partitions are set with maxtime=INFINITE, if your users are
not specifying a reasonable timelimit to their jobs, this won't
help either.
-b
On 4/24/20 1:52 PM, navin srivastava wrote:
In addition to the above when i see the sprio of both the jobs it
says :-
for normal queue jobs all jobs showing the same priority
JOBID PARTITION PRIORITY FAIRSHARE
1291352 normal 15789 15789
for GPUsmall all jobs showing the same priority.
JOBID PARTITION PRIORITY FAIRSHARE
1291339 GPUsmall 21052 21053
On Fri, Apr 24, 2020 at 11:14 PM navin srivastava
<navin.alt...@gmail.com <mailto:navin.alt...@gmail.com>> wrote:
Hi Team,
we are facing some issue in our environment. The resources
are free but job is going into the QUEUE state but not running.
i have attached the slurm.conf file here.
scenario:-
There are job only in the 2 partitions:
344 jobs are in PD state in normal partition and the node
belongs from the normal partitions are full and no more job
can run.
1300 JOBS are in GPUsmall partition are in queue and enough
CPU is avaiable to execute the jobs but i see the jobs are
not scheduling on free nodes.
Rest there are no pend jobs in any other partition .
eg:-
node status:- node18
NodeName=node18 Arch=x86_64 CoresPerSocket=18
CPUAlloc=6 CPUErr=0 CPUTot=36 CPULoad=4.07
AvailableFeatures=K2200
ActiveFeatures=K2200
Gres=gpu:2
NodeAddr=node18 NodeHostName=node18 Version=17.11
OS=Linux 4.4.140-94.42-default #1 SMP Tue Jul 17 07:44:50
UTC 2018 (0b375e4)
RealMemory=1 AllocMem=0 FreeMem=79532 Sockets=2 Boards=1
State=MIXED ThreadsPerCore=1 TmpDisk=0 Weight=1 Owner=N/A
MCS_label=N/A
Partitions=GPUsmall,pm_shared
BootTime=2019-12-10T14:16:37
SlurmdStartTime=2019-12-10T14:24:08
CfgTRES=cpu=36,mem=1M,billing=36
AllocTRES=cpu=6
CapWatts=n/a
CurrentWatts=0 LowestJoules=0 ConsumedJoules=0
ExtSensorsJoules=n/s ExtSensorsWatts=0 ExtSensorsTemp=n/s
node19:-
NodeName=node19 Arch=x86_64 CoresPerSocket=18
CPUAlloc=16 CPUErr=0 CPUTot=36 CPULoad=15.43
AvailableFeatures=K2200
ActiveFeatures=K2200
Gres=gpu:2
NodeAddr=node19 NodeHostName=node19 Version=17.11
OS=Linux 4.12.14-94.41-default #1 SMP Wed Oct 31 12:25:04
UTC 2018 (3090901)
RealMemory=1 AllocMem=0 FreeMem=63998 Sockets=2 Boards=1
State=MIXED ThreadsPerCore=1 TmpDisk=0 Weight=1 Owner=N/A
MCS_label=N/A
Partitions=GPUsmall,pm_shared
BootTime=2020-03-12T06:51:54
SlurmdStartTime=2020-03-12T06:53:14
CfgTRES=cpu=36,mem=1M,billing=36
AllocTRES=cpu=16
CapWatts=n/a
CurrentWatts=0 LowestJoules=0 ConsumedJoules=0
ExtSensorsJoules=n/s ExtSensorsWatts=0 ExtSensorsTemp=n/s
could you please help me to understand what could be the reason?