Re: [slurm-users] monitoring and update regime for Power Saving nodes

Tina Friedrich Thu, 24 Feb 2022 05:31:21 -0800

Well; couldn't you either

1) salloc the lot on a maintenance day (bit manual) or

2) make your SuspendProgram check for currently active (maintenance)reservations before shutting down a node (or some other flag)

Also, if you mod slurm.conf in preparation for maintenance days & justexclude all nodes from suspending, they ought to pick the change upbefore the power down again when prodded up, or not?



Tina

On 24/02/2022 13:03, David Simpson wrote:

Hi Tina,

Thanks, its not so much the config being fully up to date on compute nodes, its 
more, when we transition into the system wide maintenance day reservation, I 
anticipate some of the compute nodes will be down due to power saving (I'm 
expecting the reservation not to impact that, no job will be trying to start on 
the power saving node(s), therefore they will be powered down but reserved). So 
then they need a prod to come up, so that could be achieved by a short running 
job, then after that, though they are reserved, they might try to turn off, if 
their idle threshold is met. So I guess you then either need to clustershell 
off (or similar) the slurmd or roll out a temporary (maintenance period only) 
slurm.conf that won't mean idle time threshold is hit anywhere.

(I probably didn't explain the slurm config bit very well)

I am perhaps overthinking it, a dummy job to bring powered down nodes up.... 
then a clustershell slurmd stop is probably the answer

regards
David

-------------
David Simpson - Senior Systems Engineer
ARCCA, Redwood Building,
King Edward VII Avenue,
Cardiff, CF10 3NB

David Simpson - peiriannydd uwch systemau
ARCCA, Adeilad Redwood,
King Edward VII Avenue,
Caerdydd, CF10 3NB

simpso...@cardiff.ac.uk
+44 29208 74657

-----Original Message-----
From: slurm-users <slurm-users-boun...@lists.schedmd.com> On Behalf Of Tina 
Friedrich
Sent: 24 February 2022 09:43
To: slurm-users@lists.schedmd.com
Subject: Re: [slurm-users] monitoring and update regime for Power Saving nodes

Hi David,

it's also not actually a problem if the slurm.conf is not exactly the same 
immediately on boot - really. Unless there's changes that are very fundamental, 
nothing bad will happen if they pick up a new copy after, say, 5 or 10 minutes.

But it should be possible to - for example - force a run of your config 
management on startup (or before SLURM startup)?

(Not many ideas about the Nagios check, unless you change it to something that 
interrogates SLURM about node states, or keep some other record somewhere that 
it can interrogate about nodes meant to be down.)

Tina

On 24/02/2022 09:20, David Simpson wrote:

Hi Brian,

  >>For monitoring, I use a combination of netdata+prometheus. Data is
gathered whenever the nodes are up and stored for history. Yes, when
the nodes are powered down, there are empty gaps, but that is
interpreted as the node is powered down.

Ah time-series will cope much better - at the moment our monitoring
system (for compute node health at least) is nagios-like, hence the
problem. Though there's potential the entire cluster's stack may
change at some point, so this problem will be more easy to deal with
(with a change of monitoring system for node health).

  >>For the config, I have no access to DNS for configless so I use a
symlink to the slurm.conf file a shared filesystem. This works great.
Anytime there are changes, a simple 'scontrol reconfigure' brings all
running nodes up to speed and any down nodes will automatically read
the latest.

Yes, currently we use file based and config written to the compute
node's disks themselves via ansible. Perhaps we will consider moving
the file to a shared fs.

regards
David

-------------

David Simpson - Senior Systems Engineer

ARCCA, Redwood Building,

King Edward VII Avenue,

Cardiff, CF10 3NB

David Simpson - peiriannydd uwch systemau

ARCCA, Adeilad Redwood,

King Edward VII Avenue,

Caerdydd, CF10 3NB

simpso...@cardiff.ac.uk <mailto:simpso...@cardiff.ac.uk>

+44 29208 74657

*From:*slurm-users <slurm-users-boun...@lists.schedmd.com> *On Behalf
Of *Brian Andrus
*Sent:* 23 February 2022 15:27
*To:* slurm-users@lists.schedmd.com
*Subject:* Re: [slurm-users] monitoring and update regime for Power
Saving nodes

*External email to Cardiff University - *Take care when
replying/opening attachments or links.

*Nid ebost mewnol o Brifysgol Caerdydd yw hwn - *Cymerwch ofal wrth
ateb/agor atodiadau neu ddolenni.

David,

For monitoring, I use a combination of netdata+prometheus. Data is
gathered whenever the nodes are up and stored for history. Yes, when
the nodes are powered down, there are empty gaps, but that is
interpreted as the node is powered down.

For the config, I have no access to DNS for configless so I use a
symlink to the slurm.conf file a shared filesystem. This works great.
Anytime there are changes, a simple 'scontrol reconfigure' brings all
running nodes up to speed and any down nodes will automatically read
the latest.

Brian Andrus

On 2/23/2022 2:31 AM, David Simpson wrote:

     Hi all,

     Interested to know what common approaches were to:

      1. Monitoring of power saving nodes (e.g. health of the node), when
         potentially the monitoring system will see it go up and down. Do
         you limit to BMC only monitoring/health?
      2. When you want to make changes to slurm.conf (or anything else)
         to a node which is down due to power saving (during a
         maintenance/reservation) what is your approach? Do you end up
         with 2 slurm.confs (one for power saving and one that keeps
         everything up, to work on during the maintenance)?

     thanks
     David

     -------------

     David Simpson - Senior Systems Engineer

     ARCCA, Redwood Building,

     King Edward VII Avenue,

     Cardiff, CF10 3NB

     David Simpson - peiriannydd uwch systemau

     ARCCA, Adeilad Redwood,

     King Edward VII Avenue,

     Caerdydd, CF10 3NB

     simpso...@cardiff.ac.uk <mailto:simpso...@cardiff.ac.uk>

     +44 29208 74657


--
Tina Friedrich, Advanced Research Computing Snr HPC Systems Administrator

Research Computing and Support Services
IT Services, University of Oxford
https://eur03.safelinks.protection.outlook.com/?url=http%3A%2F%2Fwww.arc.ox.ac.uk%2F&amp;data=04%7C01%7Csimpsond4%40cardiff.ac.uk%7C0309504ae4784c46798008d9f77c2b60%7Cbdb74b3095684856bdbf06759778fcbc%7C1%7C0%7C637812934947426854%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000&amp;sdata=StOhnAdss5NbDB0gfTTc40ow43fYVpmZXY73xnwi%2Bec%3D&amp;reserved=0
 
https://eur03.safelinks.protection.outlook.com/?url=http%3A%2F%2Fwww.it.ox.ac.uk%2F&amp;data=04%7C01%7Csimpsond4%40cardiff.ac.uk%7C0309504ae4784c46798008d9f77c2b60%7Cbdb74b3095684856bdbf06759778fcbc%7C1%7C0%7C637812934947426854%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000&amp;sdata=7NOUTmoDVMQ4SutDOjRYBZQxxaTkP7U9CPkewAKul3w%3D&amp;reserved=0


--
Tina Friedrich, Advanced Research Computing Snr HPC Systems Administrator

Research Computing and Support Services
IT Services, University of Oxford
http://www.arc.ox.ac.uk http://www.it.ox.ac.uk

Re: [slurm-users] monitoring and update regime for Power Saving nodes

Reply via email to