Dear Thekla, Thekla Loizou <t.loi...@cyi.ac.cy> writes:
> Dear Loris, > > There is no specific node required for this array. I can verify that from > "scontrol show job 124841" since the requested node list is empty: > ReqNodeList=(null) > > Also, all 17 nodes of the cluster are identical so all nodes fulfill the job > requirements, not only node cn06. > > By "saving" the other nodes I mean that the scheduler estimates that the array > jobs will start on 2021-12-11T03:58:00. No other jobs are scheduled to run > during that time on the other nodes. So it seems that somehow the scheduler > schedules the array jobs on more than one nodes but this is not showing in the > squeue or scontrol output. My guess is that there is something wrong with either the job configuration or the node configuration, if Slurm thinks 9 jobs which require a whole node can all be started simultaneously on same node. Cheers, Loris > Regards, > > Thekla > > > On 7/12/21 12:16 μ.μ., Loris Bennett wrote: >> Hi Thekla, >> >> Thekla Loizou <t.loi...@cyi.ac.cy> writes: >> >>> Dear all, >>> >>> I have noticed that SLURM schedules several jobs from a job array on the >>> same >>> node with the same start time and end time. >>> >>> Each of these jobs requires the full node. You can see the squeue output >>> below: >>> >>> JOBID PARTITION ST START_TIME NODES SCHEDNODES >>> NODELIST(REASON) >>> >>> 124841_1 cpu PD 2021-12-11T03:58:00 1 >>> cn06 (Priority) >>> 124841_2 cpu PD 2021-12-11T03:58:00 1 >>> cn06 (Priority) >>> 124841_3 cpu PD 2021-12-11T03:58:00 1 >>> cn06 (Priority) >>> 124841_4 cpu PD 2021-12-11T03:58:00 1 >>> cn06 (Priority) >>> 124841_5 cpu PD 2021-12-11T03:58:00 1 >>> cn06 (Priority) >>> 124841_6 cpu PD 2021-12-11T03:58:00 1 >>> cn06 (Priority) >>> 124841_7 cpu PD 2021-12-11T03:58:00 1 >>> cn06 (Priority) >>> 124841_8 cpu PD 2021-12-11T03:58:00 1 >>> cn06 (Priority) >>> 124841_9 cpu PD 2021-12-11T03:58:00 1 >>> cn06 (Priority) >>> >>> Is this a bug or am I missing something? Is this because the jobs have the >>> same >>> JOBID and are still in pending state? I am aware that the jobs will not >>> actually >>> all run on the same node at the same time and that the scheduler somehow >>> takes >>> into account that this job array has 9 jobs that will need 9 nodes. I am >>> creating a timeline with the start time of all jobs and when the job array >>> jobs >>> will start running no other jobs are set to run on the remaining nodes (so >>> it >>> "saves" the other nodes for the jobs of the array even if they are all >>> scheduled >>> to run on the same node based on squeue or scontrol). >> In general jobs from an array will be scheduled on whatever nodes >> fulfil their requirements. The fact that all the jobs have >> >> cn06 >> >> as NODELIST however seems to suggest that you have either specified cn06 >> as the node the jobs should run on, or cn06 is the only node which >> fulfils the job requirements. >> >> I'm not sure what you mean about '"saving" the other nodes'. >> >> Cheers, >> >> Loris >> > -- Dr. Loris Bennett (Herr/Mr) ZEDAT, Freie Universität Berlin Email loris.benn...@fu-berlin.de