On Wednesday, 15 August 2018, at 10:01:19 (-0400), Paul Edmon wrote: > On 08/14/2018 05:16 AM, Pablo Llopis wrote: > > > >Integration with a possible built-in healthcheck is also something > >to consider, as the orchestration logic would need to take care of > >disabling the healthcheck funcionality that automatically > >restores/resumes drained nodes to avoid conflicts. > > So we use NHC for our automatic node closer. For reopening we have > a series of scripts that we use but they are all ad hoc and not > formalized. Same with closing off subsets of nodes we just have a > bunch of bash scripts that we have rolled to do that.
Every site is different, and so your needs may vary. But for those sites that use NHC, I just wanted to note how it handles the conflict avoidance issue that was mentioned. If a node comes back clean (all NHC tests passed), and if MARK_OFFLINE is set to 1, then NHC will kick off a helper script called "node-mark-online" to bring the node back into service. The version of node-mark-online that comes with NHC will *ONLY* return nodes to service if the SLURM "Reason" field for that node (as shown by, e.g., "sinfo -Rl") starts with "NHC:" (meaning that the message was put there by NHC itself). Any nodes that are in a DRAIN or DOWN state that were *not* drained by NHC itself will be left alone. This way, if your Ops staff need to take a node out of service for some reason, NHC won't try to put it back in service just because it passes all tests. So you can safely use NHC to orchestrate operations across your compute nodes -- relying on it to both drain unpatched nodes initially and then restore them to service afterward -- OR you can use some other orchestration tool like Ansible knowing that NHC will not interfere in its activities. As far as tool recommendations go, apart from NHC, we use a LANL-created utility called "pexec" which can leverage netgroups, SLURM node states (like "allup" or "alldown"), node ranges, and so on. It's available at https://github.com/hpc/pexec We also use pdsh and are planning to investigate clush and other options in the near future. HTH! Michael -- Michael E. Jennings <m...@lanl.gov> HPC Systems Team, Los Alamos National Laboratory Bldg. 03-2327, Rm. 2341 W: +1 (505) 606-0605