Hi Ken, Here is my slurm.conf:
ControlMachine=s19r2b08 AuthType=auth/none CryptoType=crypto/openssl JobCredentialPrivateKey=/home/bsc33/bsc33882/slurm_over_slurm/etc/slurm.key JobCredentialPublicCertificate=/home/bsc33/bsc33882/slurm_over_slurm/etc/slurm.cert MpiDefault=none ProctrackType=proctrack/linuxproc ReturnToService=1 SlurmctldPidFile=/home/bsc33/bsc33882/slurm_over_slurm/var/run/slurmctld.pid SlurmctldPort=7001 SlurmdPidFile=/home/bsc33/bsc33882/slurm_over_slurm/var/run/slurmd.%n.pid SlurmdPort=8009 SlurmdSpoolDir=/home/bsc33/bsc33882/slurm_over_slurm/var/spool/slurmd.%n SlurmUser=bsc33882 SlurmdUser=bsc33882 StateSaveLocation=/home/bsc33/bsc33882/slurm_over_slurm/var/state SwitchType=switch/none TaskPlugin=task/none TaskPluginParam=autobind=cores # TIMERS InactiveLimit=1800 KillWait=60 MinJobAge=300 OverTimeLimit=1 SlurmctldTimeout=300 SlurmdTimeout=300 # SCHEDULING FastSchedule=1 SchedulerType=sched/backfill SelectType=select/linear SchedulerParameters=bf_interval=30,default_queue_depth=50 # LOGGING AND ACCOUNTING ClusterName=cluster JobCompType=jobcomp/script JobCompLoc=/home/bsc33/bsc33882/slurm_over_slurm/script/trace.sh JobAcctGatherFrequency=30 JobAcctGatherType=jobacct_gather/none SlurmctldDebug=7 SlurmctldLogFile=/home/bsc33/bsc33882/slurm_over_slurm/var/slurmctld.log SlurmdDebug=7 SlurmdLogFile=/home/bsc33/bsc33882/slurm_over_slurm/var/slurmd.%n.log DebugFlags=Backfill,SelectType # COMPUTE NODES NodeName=s19r2b[09-10,12,14,16] CPUs=48 Sockets=2 CoresPerSocket=24 ThreadsPerCore=1 State=IDLE Port=7009 PartitionName=debug Nodes=s19r2b[09-10,12,14,16] Default=YES MaxTime=INFINITE State=UP On Tue, 4 Dec 2018 at 04:59, Kenneth Roberts <krobe...@materialsdesign.com> wrote: > Hi – > > > > The time stamps show that your 1st sbatch job components start at the > same time and then run for 1 minute. > > > > 30 seconds after the simultaneous end of all three components of the 1st > sbatch, the two components of the 3rd sbatch and the three components of > the 2nd all start. The two components of the 3rd batch each run for 20 > seconds. The three components of the 2nd sbatch all run for 1 minute. > > > > The 3rd sbatch start was delayed 5 seconds by the sleep, so they didn’t > start with the 1st batch. > > > > Are you able to give the other parameters of your setup? The SelectType? > The node specs? These will affect scheduling. Note, I’m wading into deep > waters for me ... still learning slurm. (slurming? ;-) > > > > *From:* slurm-users <slurm-users-boun...@lists.schedmd.com> *On Behalf Of > *Ana Jokanovic > *Sent:* Monday, December 3, 2018 12:40 AM > *To:* slurm-users@lists.schedmd.com > *Subject:* Re: [slurm-users] backfill scheduler does not work for > heterogeneous jobs (version 17.11) > > > > Hi Ken, > > > > I have read this page and I understood that in case of my example the > third job should be backfilled. The second job can start after 15 minutes, > but the third job requires only two nodes and 2 minutes, thus it can start > immediately, but this does not happen. > > > > In the page that you referred to, they give an example: > > > > For example, consider a heterogeneous job with three components. When > considered as independent jobs, the components could be initiated at times > now (component 0), now plus 2 hour (component 1), and now plus 1 hours > (component 2). When the backfill scheduler runs in the first mode: > > 1. Component 0 will be noted to possible to start now, but not > initiated due to the additional components to be initiated > > 2. Component 1 will be noted to be possible to start in 2 hours > > 3. Component 2 will not be considered for scheduling until 2 hours in > the future,* which leave some additional resources available for > scheduling to other jobs* > > When the backfill scheduler executes next, it will use the second mode and > (assuming no other state changes) all three job components will be > considered available for scheduling no earlier than 2 hours in the future, > *which > may allow other jobs to be allocated resources before heterogeneous job > component 0 could be initiated.* > > From this example, I understand that in my experiment the third job should > be backfilled. The second job can start after 15 minutes, but the third job > requires only two nodes and 2 minutes, thus it can start immediately, but > this does not happen. > > > > It seems there is a bug here. I also tried with the version 18.03, but it > does not work either. > > > > Ana > > > > > > On Fri, 30 Nov 2018 at 17:46, Kenneth Roberts < > krobe...@materialsdesign.com> wrote: > > There are some Limitations that mention backfill on the heterogeneous job > support page. > > > > https://slurm.schedmd.com/heterogeneous_jobs.html#limitations > > > > Maybe there’s some information there to help? > > > > Ken > > > > *From:* slurm-users <slurm-users-boun...@lists.schedmd.com> *On Behalf Of > *Ana Jokanovic > *Sent:* Thursday, November 29, 2018 4:28 AM > *To:* slurm-users@lists.schedmd.com > *Subject:* [slurm-users] backfill scheduler does not work for > heterogeneous jobs (version 17.11) > > > > > > > > Hello, > > > > I did a simple test submitting the workload of three jobs (see below) on a > cluster of 5 nodes: > > > > sbatch --cpus-per-task=2 --ntasks=6 --time=15 : --cpus-per-task=2 > --ntasks=6 --time=15 : --cpus-per-task=2 --ntasks=6 --time=15 > > sbatch --cpus-per-task=2 --ntasks=6 --time=15 : --cpus-per-task=2 > --ntasks=6 --time=15 : --cpus-per-task=2 --ntasks=6 --time=15 > > sleep 5; > > sbatch --ntasks=1 --time=2 : --ntasks=1 --time=1 > > > > I would expect that the third submitted job is backfilled but it does not > happen. > > Here is the job completion log: > > > > JobId=2 UserId=3113 GroupId=8950 Name=sleep JobState=COMPLETED > Partition=debug TimeLimit=00:15:00 SubmitTime=1543317694 > StartTime=1543317714 EndTime=1543317774 NodeList=s19r2b09 NodeCnt=1 > ProcCnt=48 > > JobId=3 UserId=3113 GroupId=8950 Name=sleep JobState=COMPLETED > Partition=debug TimeLimit=00:15:00 SubmitTime=1543317694 > StartTime=1543317714 EndTime=1543317774 NodeList=s19r2b10 NodeCnt=1 > ProcCnt=48 > > JobId=4 UserId=3113 GroupId=8950 Name=sleep JobState=COMPLETED > Partition=debug TimeLimit=00:15:00 SubmitTime=1543317694 > StartTime=1543317714 EndTime=1543317774 NodeList=s19r2b12 NodeCnt=1 > ProcCnt=48 > > JobId=8 UserId=3113 GroupId=8950 Name=sleep JobState=COMPLETED > Partition=debug TimeLimit=00:02:00 SubmitTime=1543317699 > StartTime=1543317804 EndTime=1543317824 NodeList=s19r2b14 NodeCnt=1 > ProcCnt=48 > > JobId=9 UserId=3113 GroupId=8950 Name=sleep JobState=COMPLETED > Partition=debug TimeLimit=00:01:00 SubmitTime=1543317699 > StartTime=1543317804 EndTime=1543317824 NodeList=s19r2b16 NodeCnt=1 > ProcCnt=48 > > JobId=5 UserId=3113 GroupId=8950 Name=sleep JobState=COMPLETED > Partition=debug TimeLimit=00:15:00 SubmitTime=1543317694 > StartTime=1543317804 EndTime=1543317864 NodeList=s19r2b09 NodeCnt=1 > ProcCnt=48 > > JobId=6 UserId=3113 GroupId=8950 Name=sleep JobState=COMPLETED > Partition=debug TimeLimit=00:15:00 SubmitTime=1543317694 > StartTime=1543317804 EndTime=1543317864 NodeList=s19r2b10 NodeCnt=1 > ProcCnt=48 > > JobId=7 UserId=3113 GroupId=8950 Name=sleep JobState=COMPLETED > Partition=debug TimeLimit=00:15:00 SubmitTime=1543317694 > StartTime=1543317804 EndTime=1543317864 NodeList=s19r2b12 NodeCnt=1 > ProcCnt=48 > > > > Would you expect this behavior? > > > > Thanks. > > > > Best regards, > > Ana > > -- > > Ana Jokanovic, PhD > Barcelona Supercomputing Center > c/ Jordi Girona 1-3, K2M Building, 1st floor > 08034 Barcelona - SPAIN > e-mail: ana...@gmail.com or ana.jokano...@bsc.es > tel: +34 93 4137246 > > > > > -- > > Ana Jokanovic, PhD > Barcelona Supercomputing Center > c/ Jordi Girona 1-3, K2M Building, 1st floor > 08034 Barcelona - SPAIN > e-mail: ana...@gmail.com or ana.jokano...@bsc.es > tel: +34 93 4137246 > > > > > -- > > Ana Jokanovic, PhD > Barcelona Supercomputing Center > c/ Jordi Girona 1-3, K2M Building, 1st floor > 08034 Barcelona - SPAIN > e-mail: ana...@gmail.com or ana.jokano...@bsc.es > tel: +34 93 4137246 > > > > > -- > > Ana Jokanovic, PhD > Barcelona Supercomputing Center > c/ Jordi Girona 1-3, K2M Building, 1st floor > 08034 Barcelona - SPAIN > e-mail: ana...@gmail.com or ana.jokano...@bsc.es > tel: +34 93 4137246 > -- Ana Jokanovic, PhD Barcelona Supercomputing Center c/ Jordi Girona 1-3, K2M Building, 1st floor 08034 Barcelona - SPAIN e-mail: ana...@gmail.com or ana.jokano...@bsc.es tel: +34 93 4137246