Hi Brian,

>>For monitoring, I use a combination of netdata+prometheus. Data is gathered 
>>whenever the nodes are up and stored for history. Yes, when the nodes are 
>>powered down, there are empty gaps, but that is interpreted as the node is 
>>powered down.

Ah time-series will cope much better - at the moment our monitoring system (for 
compute node health at least) is nagios-like, hence the problem. Though there’s 
potential the entire cluster’s stack may change at some point, so this problem 
will be more easy to deal with (with a change of monitoring system for node 
health).

>>For the config, I have no access to DNS for configless so I use a symlink to 
>>the slurm.conf file a shared filesystem. This works great. Anytime there are 
>>changes, a simple 'scontrol reconfigure' brings all running nodes up to speed 
>>and any down nodes will automatically read the latest.

Yes, currently we use file based and config written to the compute node’s disks 
themselves via ansible. Perhaps we will consider moving the file to a shared fs.

regards
David


-------------
David Simpson - Senior Systems Engineer
ARCCA, Redwood Building,
King Edward VII Avenue,
Cardiff, CF10 3NB

David Simpson - peiriannydd uwch systemau
ARCCA, Adeilad Redwood,
King Edward VII Avenue,
Caerdydd, CF10 3NB

simpso...@cardiff.ac.uk<mailto:simpso...@cardiff.ac.uk>
+44 29208 74657

From: slurm-users <slurm-users-boun...@lists.schedmd.com> On Behalf Of Brian 
Andrus
Sent: 23 February 2022 15:27
To: slurm-users@lists.schedmd.com
Subject: Re: [slurm-users] monitoring and update regime for Power Saving nodes

External email to Cardiff University - Take care when replying/opening 
attachments or links.
Nid ebost mewnol o Brifysgol Caerdydd yw hwn - Cymerwch ofal wrth ateb/agor 
atodiadau neu ddolenni.


David,

For monitoring, I use a combination of netdata+prometheus. Data is gathered 
whenever the nodes are up and stored for history. Yes, when the nodes are 
powered down, there are empty gaps, but that is interpreted as the node is 
powered down.

For the config, I have no access to DNS for configless so I use a symlink to 
the slurm.conf file a shared filesystem. This works great. Anytime there are 
changes, a simple 'scontrol reconfigure' brings all running nodes up to speed 
and any down nodes will automatically read the latest.

Brian Andrus
On 2/23/2022 2:31 AM, David Simpson wrote:
Hi all,

Interested to know what common approaches were to:



  1.  Monitoring of power saving nodes (e.g. health of the node), when 
potentially the monitoring system will see it go up and down. Do you limit to 
BMC only monitoring/health?
  2.  When you want to make changes to slurm.conf (or anything else) to a node 
which is down due to power saving (during a maintenance/reservation) what is 
your approach? Do you end up with 2 slurm.confs (one for power saving and one 
that keeps everything up, to work on during the maintenance)?

thanks
David


-------------
David Simpson - Senior Systems Engineer
ARCCA, Redwood Building,
King Edward VII Avenue,
Cardiff, CF10 3NB

David Simpson - peiriannydd uwch systemau
ARCCA, Adeilad Redwood,
King Edward VII Avenue,
Caerdydd, CF10 3NB

simpso...@cardiff.ac.uk<mailto:simpso...@cardiff.ac.uk>
+44 29208 74657

Reply via email to