Hi James,

Just for a slightly different take, 2-3 minutes seems a bit long for an
epilog script. Do you need to run all of those checks after every job?

Also, you describe it as running health checks; why not run those checks
via the HealthCheckProgram every HealthCheckInterval (e.g. 1 hour)?

Or better, split more job-specific checks into the Epilog and put general
node-specific checks into HealthCheckProgram.

But either way, as Lyn noted, you might still need to set CompleteWait to a
non-zero value to allow the epilog to finish.

Kind regards,
Paddy

On Mon, Feb 03, 2020 at 01:58:15PM +0000, Erwin, James wrote:

> Hello,
> Thank you for your reply Lyn. I found a temporary workaround (epilog touching 
> a file in /tmp/ and making a prolog wait until the epilog finishes and 
> removes the file).
> I was looking at CompleteWait before I tried these work-arounds but as it is 
> written in the docs, I do not understand how this would help.
> 
> CompleteWait
> The time, in seconds, given for a job to remain in COMPLETING state before 
> any additional jobs are scheduled. If set to zero, pending jobs will be 
> started as soon as possible. Since a COMPLETING job's resources are released 
> for use by other jobs as soon as the Epilog completes on each individual 
> node, this can result in very fragmented resource allocations.
> 
> In my case, the epilog is still executing (according ps and the health 
> checks), and slurm still starts new jobs on the node.
> 
> Thanks,
> James
> 
> 
> From: slurm-users <slurm-users-boun...@lists.schedmd.com> On Behalf Of Lyn 
> Gerner
> Sent: Wednesday, January 22, 2020 12:27 PM
> To: Slurm User Community List <slurm-users@lists.schedmd.com>
> Cc: slurm-us...@schedmd.com
> Subject: Re: [slurm-users] SLURM starts new job before CG finishes
> 
> James, you might take a look at CompleteWait and KillWait.
> 
> Regards,
> Lyn
> 
> On Fri, Jan 3, 2020 at 12:27 PM Erwin, James 
> <james.er...@intel.com<mailto:james.er...@intel.com>> wrote:
> Hello,
> 
> I’ve recently updated a cluster to SLURM 19.05.4 and notice that new jobs are 
> starting on nodes still in the CG state. In an epilog I am running node 
> health checks that last about 2-3 minutes. In the previous version (ancient 
> 15.08), jobs would not start running on these nodes until the epilog was 
> complete and the node is out of the CG state. Does anyone know why this 
> overlap of R with CG might be happening?
> 
> There is a release note for version 19.05.3 that looks possibly related but 
> I’m not exactly sure what it means:
> 
> * Changes in Slurm 19.05.3
> ==========================
> ...
> -- Nodes in COMPLETING state treated as being currently available for job
>     will-run test.
> 
> 
> Thanks,
> James
> 

-- 
Paddy Doyle
Research IT / Trinity Centre for High Performance Computing,
Lloyd Building, Trinity College Dublin, Dublin 2, Ireland.
Phone: +353-1-896-3725
https://www.tchpc.tcd.ie/

Reply via email to