Hi,

We also noticed this. We eventually placed the max time on the
HealthCheckInterval (65535), and created a systemd.timer which runs the
scripts externally of slurm, with proper intervals and randomized delays.

    Yair.

On Wed, Dec 2, 2020 at 9:03 AM <taleinterve...@sjtu.edu.cn> wrote:

> Hello,
>
>
>
> Our slurm cluster managed about 600+ nodes and I tested to set
> HealthCheckNodeState=CYCLE in slurm.conf. According to conf manual, setting
> this to CYCLE shall cause slurm to “cycle through running on all compute
> nodes through the course of the HealthCheckInterval”. So I set
> “HealthCheckInterval = 600”, and expected the health check time point can
> be evenly distributed across the 600 seconds period.
>
> But the test result showed that the earliest checked node is at about
> 14:19:35, while the latest checked node is at about 14:20:39. A round of
> the health checks only distributed across 60+ seconds? And the previous
> checking round distributed from 14:08:10 to 14:09:26, it seems the
> HealthCheckInterval only control the time interval between two rounds, not
> the time range distributed by one round checkings.
>
> So did I mistake the description in conf’s manual? And is there any method
> can control the health check frequency in one round between different nodes?
>
>
>
> Thanks.
>

Reply via email to