On Monday, 7 May 2018 11:58:38 PM AEST Cory Holcomb wrote: > Thank you, for the reply I was beginning to wonder if my message was seen.
It's a busy list at times. :-) > While I understand how batch systems work, if you have a system daemon that > develops a memory leak and consumes the memory outside of allocation. Understood. > Not checking the used memory on the box before dispatch seems like a good > way to black hole a bunch of jobs. This is why Slurm has support for healthcheck scripts that can run regularly as well as before/after a job is launched. These can knock nodes offline. It's documented in the slurm.conf manual page. For instance there's the LBNL Node Health Check (NHC) system that plugs into both Slurm and Torque. https://slurm.schedmd.com/SUG14/node_health_check.pdf https://github.com/mej/nhc At ${JOB-1} we would run our in-house health check from cron and write to a file in /dev/shm so that all the actual Slurm health check script would do is send that to Slurm (and raise an error if it was missing). This was because we used to see health checks block due to issues and so slurmd would lock up running them. Decoupling them fixed that. Best of luck, Chris -- Chris Samuel : http://www.csamuel.org/ : Melbourne, VIC