Re: [slurm-users] [External] Hibernating a whole cluster

Ole Holm Nielsen Mon, 06 Feb 2023 12:26:38 -0800

I would agree with Florian about using the Slurm power_save method.

In the Wiki pagehttps://wiki.fysik.dtu.dk/Niflheim_system/Slurm_cloud_bursting/#configuring-slurm-conf-for-power-savingthere are additional details and scripts for performing node suspend andresume.

You would need the server to have a BMC so that you can power it downand up using IPMI commands from your Slurm management server.


/Ole


On 06-02-2023 21:07, Florian Zillner wrote:

follow this guide: https://slurm.schedmd.com/power_save.html<https://slurm.schedmd.com/power_save.html>
Create poweroff / poweron scripts and configure slurm to do the poweroffafter X minutes. Works well for us. Make sure to set an appropriate time(ResumeTimeout) to allow the node to come back to service.Note that we did not achieve good power saving with suspending thenodes, powering them off and on saves way more power. The downside is ittakes ~ 5 mins to resume (= power on) the nodes when needed.
Cheers,
Florian
------------------------------------------------------------------------
*From:* slurm-users <slurm-users-boun...@lists.schedmd.com> on behalf ofAnalabha Roy <hariseldo...@gmail.com>
*Sent:* Monday, 6 February 2023 18:21
*To:* slurm-users@lists.schedmd.com <slurm-users@lists.schedmd.com>
*Subject:* [External] [slurm-users] Hibernating a whole cluster
Hi,
I've just finished setup of a single node "cluster" with slurm onubuntu 20.04. Infrastructural limitations prevent me from running it24/7, and it's only powered on during business hours.
Currently, I have a cron job running that hibernates that sole nodebefore closing time.
The hibernation is done with standard systemd, and hibernates to theswap partition.
I have not run any lengthy slurm jobs on it yet. Before I do, can Iget some thoughts on a couple of things?
If it hibernated when slurm still had jobs running/queued, would theyresume properly when the machine powers back on?
Note that my swap space is bigger than my  RAM.
Is it necessary to perhaps setup a pre-hibernate script for systemd toiterate scontrol to suspend all the jobs before hibernating and resumethem post-resume?
What about the wall times? I'm uessing that slurm will count thedowntime as elapsed for each job. Is there a way to config this, or isthe only alternative a post-hibernate script that iteratively updatesthe wall times of the running jobs using scontrol again?

Re: [slurm-users] [External] Hibernating a whole cluster

Reply via email to