Hi, We also noticed this. We eventually placed the max time on the HealthCheckInterval (65535), and created a systemd.timer which runs the scripts externally of slurm, with proper intervals and randomized delays.
Yair. On Wed, Dec 2, 2020 at 9:03 AM <taleinterve...@sjtu.edu.cn> wrote: > Hello, > > > > Our slurm cluster managed about 600+ nodes and I tested to set > HealthCheckNodeState=CYCLE in slurm.conf. According to conf manual, setting > this to CYCLE shall cause slurm to “cycle through running on all compute > nodes through the course of the HealthCheckInterval”. So I set > “HealthCheckInterval = 600”, and expected the health check time point can > be evenly distributed across the 600 seconds period. > > But the test result showed that the earliest checked node is at about > 14:19:35, while the latest checked node is at about 14:20:39. A round of > the health checks only distributed across 60+ seconds? And the previous > checking round distributed from 14:08:10 to 14:09:26, it seems the > HealthCheckInterval only control the time interval between two rounds, not > the time range distributed by one round checkings. > > So did I mistake the description in conf’s manual? And is there any method > can control the health check frequency in one round between different nodes? > > > > Thanks. >