Thankyou both. For interest, this is the health check https://github.com/amd/node-scraper/
On Mon, 18 Aug 2025 at 14:01, Bjørn-Helge Mevik via slurm-users < [email protected]> wrote: > Ole Holm Nielsen via slurm-users <[email protected]> writes: > > > On 8/18/25 13:56, Gerhard Strangar via slurm-users wrote: > >> John Hearns via slurm-users wrote: > >> > >>> I want to run a healtcheck job on all nodes. > >> And using HealthCheckProgram in the slurm.conf would be too easy? > > > > But the HealthCheckProgram=/usr/sbin/nhc is executed only when slurmd > > is started, and possibly when a new job is started. > > That depends on HealthCheckInterval and HealthCheckNodeState. If > HealthCheckInterval=N with N > 0, the HealthCheckProgram is run every N > seconds, given that the node is in one of the HealthCheckNodeState > states (default: any state). > > > I think John asked for a way to run NHC on a set of nodes whenever > > desired by the system administrator, and not at any any random time, > > right? ClusterShell is the ideal tool for making such parallel > > commands on the cluster. > > Yes, for running manually, setting up the Slurm groups in clush is the > easiest way, IMO. > > -- > Regards, > Bjørn-Helge Mevik > > -- > slurm-users mailing list -- [email protected] > To unsubscribe send an email to [email protected] >
-- slurm-users mailing list -- [email protected] To unsubscribe send an email to [email protected]
