Hi James, Just for a slightly different take, 2-3 minutes seems a bit long for an epilog script. Do you need to run all of those checks after every job?
Also, you describe it as running health checks; why not run those checks via the HealthCheckProgram every HealthCheckInterval (e.g. 1 hour)? Or better, split more job-specific checks into the Epilog and put general node-specific checks into HealthCheckProgram. But either way, as Lyn noted, you might still need to set CompleteWait to a non-zero value to allow the epilog to finish. Kind regards, Paddy On Mon, Feb 03, 2020 at 01:58:15PM +0000, Erwin, James wrote: > Hello, > Thank you for your reply Lyn. I found a temporary workaround (epilog touching > a file in /tmp/ and making a prolog wait until the epilog finishes and > removes the file). > I was looking at CompleteWait before I tried these work-arounds but as it is > written in the docs, I do not understand how this would help. > > CompleteWait > The time, in seconds, given for a job to remain in COMPLETING state before > any additional jobs are scheduled. If set to zero, pending jobs will be > started as soon as possible. Since a COMPLETING job's resources are released > for use by other jobs as soon as the Epilog completes on each individual > node, this can result in very fragmented resource allocations. > > In my case, the epilog is still executing (according ps and the health > checks), and slurm still starts new jobs on the node. > > Thanks, > James > > > From: slurm-users <slurm-users-boun...@lists.schedmd.com> On Behalf Of Lyn > Gerner > Sent: Wednesday, January 22, 2020 12:27 PM > To: Slurm User Community List <slurm-users@lists.schedmd.com> > Cc: slurm-us...@schedmd.com > Subject: Re: [slurm-users] SLURM starts new job before CG finishes > > James, you might take a look at CompleteWait and KillWait. > > Regards, > Lyn > > On Fri, Jan 3, 2020 at 12:27 PM Erwin, James > <james.er...@intel.com<mailto:james.er...@intel.com>> wrote: > Hello, > > I’ve recently updated a cluster to SLURM 19.05.4 and notice that new jobs are > starting on nodes still in the CG state. In an epilog I am running node > health checks that last about 2-3 minutes. In the previous version (ancient > 15.08), jobs would not start running on these nodes until the epilog was > complete and the node is out of the CG state. Does anyone know why this > overlap of R with CG might be happening? > > There is a release note for version 19.05.3 that looks possibly related but > I’m not exactly sure what it means: > > * Changes in Slurm 19.05.3 > ========================== > ... > -- Nodes in COMPLETING state treated as being currently available for job > will-run test. > > > Thanks, > James > -- Paddy Doyle Research IT / Trinity Centre for High Performance Computing, Lloyd Building, Trinity College Dublin, Dublin 2, Ireland. Phone: +353-1-896-3725 https://www.tchpc.tcd.ie/