On Thursday, 10 May 2018, at 20:02:37 (+1000), Chris Samuel wrote: > For instance there's the LBNL Node Health Check (NHC) system that plugs into > both Slurm and Torque. > > https://slurm.schedmd.com/SUG14/node_health_check.pdf > > https://github.com/mej/nhc > > At ${JOB-1} we would run our in-house health check from cron and write to a > file in /dev/shm so that all the actual Slurm health check script would do is > send that to Slurm (and raise an error if it was missing). This was > because > we used to see health checks block due to issues and so slurmd would lock up > running them. Decoupling them fixed that.
I'm surprised to hear that; this is the first time I've ever heard that in regards to SLURM. I'd only ever heard folks complain about TORQUE having that issue. FWIW, due to this exact situation, LBNL NHC has a built-in feature called "Detached Mode" that will do something very similar, but it's all self-contained within NHC. The foreground process will fork off a child and then return the results from the previous run while the child goes off, runs all the tests, and stores its results to a file for the next health check cycle to return. You can read more about it here: https://github.com/mej/nhc#detached-mode HTH, Michael PS: Hi Chris! :-D -- Michael E. Jennings <m...@lanl.gov> HPC Systems Team, Los Alamos National Laboratory Bldg. 03-2327, Rm. 2341 W: +1 (505) 606-0605