Re: [slurm-users] [External] Hibernating a whole cluster

Diego Zuccato Tue, 07 Feb 2023 05:57:03 -0800

That's probably not optimal, but could work. I'd go with brutalpreemption: swapping 90+G can be quite time-consuming.


Diego


Il 07/02/2023 14:18, Analabha Roy ha scritto:

On Tue, 7 Feb 2023, 18:12 Diego Zuccato, <diego.zucc...@unibo.it<mailto:diego.zucc...@unibo.it>> wrote:


    RAM used by a suspended job is not released. At most it can be swapped
    out (if enough swap is available).

There should be enough swap available. I have 93 gigs of Ram and as biga swap partition. I can top it off with swap files if needed.






    Il 07/02/2023 13:14, Analabha Roy ha scritto:
     > Hi Sean,
     >
     > Thanks for your awesome suggestion! I'm going through the
    reservation
     > docs now. At first glance, it seems like a daily reservation
    would turn
     > down jobs that are too big for the reservation. It'd be nice if
     > slurm could suspend (in the manner of 'scontrol suspend') jobs
    during
     > reserved downtime and resume them after. That way, folks can submit
     > large jobs without having to worry about the downtimes. Perhaps
    the FLEX
     > option in reservations can accomplish this somehow?
     >
     >
     > I suppose that I can do it using a shell script iterator and a
    cron job,
     > but that seems like an ugly hack. I was hoping if there is a way to
     > config this in slurm itself?
     >
     > AR
     >
     > On Tue, 7 Feb 2023 at 16:06, Sean Mc Grath <smcg...@tcd.ie
    <mailto:smcg...@tcd.ie>
     > <mailto:smcg...@tcd.ie <mailto:smcg...@tcd.ie>>> wrote:
     >
     >     Hi Analabha,
     >
     >     Could you do something like create a daily reservation for 8
    hours
     >     that starts at 9am, or whatever times work for you like the
     >     following untested command:
     >
     >     scontrol create reservation starttime=09:00:00 duration=8:00:00
     >     nodecnt=1 flags=daily ReservationName=daily
     >
     >     Daily option at
    https://slurm.schedmd.com/scontrol.html#OPT_DAILY
    <https://slurm.schedmd.com/scontrol.html#OPT_DAILY>
     >     <https://slurm.schedmd.com/scontrol.html#OPT_DAILY
    <https://slurm.schedmd.com/scontrol.html#OPT_DAILY>>
     >
     >     Some more possible helpful documentation at
     > https://slurm.schedmd.com/reservations.html
    <https://slurm.schedmd.com/reservations.html>
     >     <https://slurm.schedmd.com/reservations.html
    <https://slurm.schedmd.com/reservations.html>>, search for "daily".
     >
     >     My idea being that jobs can only run in that reservation, (that
     >     would have to be configured separately, not sure how from the
    top of
     >     my head), which is only active during the times you want the
    node to
     >     be working. So the cronjob that hibernates/shuts it down will
    do so
     >     when there are no jobs running. At least in theory.
     >
     >     Hope that helps.
     >
     >     Sean
     >
     >     ---
     >     Sean McGrath
     >     Senior Systems Administrator, IT Services
     >

> ------------------------------------------------------------------------

     >     *From:* slurm-users <slurm-users-boun...@lists.schedmd.com
    <mailto:slurm-users-boun...@lists.schedmd.com>
     >     <mailto:slurm-users-boun...@lists.schedmd.com
    <mailto:slurm-users-boun...@lists.schedmd.com>>> on behalf of
     >     Analabha Roy <hariseldo...@gmail.com
    <mailto:hariseldo...@gmail.com> <mailto:hariseldo...@gmail.com
    <mailto:hariseldo...@gmail.com>>>
     >     *Sent:* Tuesday 7 February 2023 10:05
     >     *To:* Slurm User Community List
    <slurm-users@lists.schedmd.com <mailto:slurm-users@lists.schedmd.com>
     >     <mailto:slurm-users@lists.schedmd.com
    <mailto:slurm-users@lists.schedmd.com>>>
     >     *Subject:* Re: [slurm-users] [External] Hibernating a whole
    cluster
     >     Hi,
     >
     >     Thanks. I had read the Slurm Power Saving Guide before. I believe
     >     the configs enable slurmctld to check other nodes for
    idleness and
     >     suspend/resume them. Slurmctld must run on a separate, always-on
     >     server for this to work, right?
     >
     >     My issue might be a little different. I literally have only
    one node
     >     that runs everything: slurmctld, slurmd, slurmdbd, everything.
     >
     >     This node must be set to "sudo systemctl hibernate"after business
     >     hours, regardless of whether jobs are queued or running. The next
     >     business day, it can be switched on manually.
     >
     >     systemctl hibernate is supposed to save the entire run state
    of the
     >     sole node to swap and poweroff. When powered on again, it should
     >     restore everything to its previous running state.
     >
     >     When the job queue is empty, this works well. I'm not sure
    how well
     >     this hibernate/resume will work with running jobs and would
     >     appreciate any suggestions or insights.
     >
     >     AR
     >
     >
     >     On Tue, 7 Feb 2023 at 01:39, Florian Zillner
    <fzill...@lenovo.com <mailto:fzill...@lenovo.com>
     >     <mailto:fzill...@lenovo.com <mailto:fzill...@lenovo.com>>> wrote:
     >
     >         Hi,
     >
     >         follow this guide:
    https://slurm.schedmd.com/power_save.html
    <https://slurm.schedmd.com/power_save.html>
     >         <https://slurm.schedmd.com/power_save.html
    <https://slurm.schedmd.com/power_save.html>>
     >
     >         Create poweroff / poweron scripts and configure slurm to
    do the
     >         poweroff after X minutes. Works well for us. Make sure to
    set an
     >         appropriate time (ResumeTimeout) to allow the node to
    come back
     >         to service.
     >         Note that we did not achieve good power saving with
    suspending
     >         the nodes, powering them off and on saves way more power. The
     >         downside is it takes ~ 5 mins to resume (= power on) the
    nodes
     >         when needed.
     >
     >         Cheers,
     >         Florian

> ------------------------------------------------------------------------

     >         *From:* slurm-users
    <slurm-users-boun...@lists.schedmd.com
    <mailto:slurm-users-boun...@lists.schedmd.com>
     >         <mailto:slurm-users-boun...@lists.schedmd.com
    <mailto:slurm-users-boun...@lists.schedmd.com>>> on behalf of
     >         Analabha Roy <hariseldo...@gmail.com
    <mailto:hariseldo...@gmail.com>
     >         <mailto:hariseldo...@gmail.com
    <mailto:hariseldo...@gmail.com>>>
     >         *Sent:* Monday, 6 February 2023 18:21
     >         *To:* slurm-users@lists.schedmd.com
    <mailto:slurm-users@lists.schedmd.com>
     >         <mailto:slurm-users@lists.schedmd.com
    <mailto:slurm-users@lists.schedmd.com>>
     >         <slurm-users@lists.schedmd.com
    <mailto:slurm-users@lists.schedmd.com>
     >         <mailto:slurm-users@lists.schedmd.com
    <mailto:slurm-users@lists.schedmd.com>>>
     >         *Subject:* [External] [slurm-users] Hibernating a whole
    cluster
     >         Hi,
     >
     >         I've just finished  setup of a single node "cluster" with
    slurm
     >         on ubuntu 20.04. Infrastructural limitations  prevent me from
     >         running it 24/7, and it's only powered on during
    business hours.
     >
     >
     >         Currently, I have a cron job running that hibernates that
    sole
     >         node before closing time.
     >
     >         The hibernation is done with standard systemd, and
    hibernates to
     >         the swap partition.
     >
     >           I have not run any lengthy slurm jobs on it yet. Before
    I do,
     >         can I get some thoughts on a couple of things?
     >
     >         If it hibernated when slurm still had jobs
    running/queued, would
     >         they resume properly when the machine powers back on?
     >
     >         Note that my swap space is bigger than my  RAM.
     >
     >         Is it necessary to perhaps setup a pre-hibernate script for
     >         systemd to  iterate scontrol to suspend all the jobs before
     >         hibernating and resume them post-resume?
     >
     >         What about the wall times? I'm uessing that slurm will
    count the
     >         downtime as elapsed for each job. Is there a way to
    config this,
     >         or is the only alternative a post-hibernate script that
     >         iteratively updates the wall times of the running jobs using
     >         scontrol again?
     >
     >         Thanks for your attention.
     >         Regards
     >         AR
     >
     >
     >
     >     --
     >     Analabha Roy
     >     Assistant Professor
     >     Department of Physics
     >     <http://www.buruniv.ac.in/academics/department/physics
    <http://www.buruniv.ac.in/academics/department/physics>>
     >     The University of Burdwan <http://www.buruniv.ac.in/
    <http://www.buruniv.ac.in/>>
     >     Golapbag Campus, Barddhaman 713104
     >     West Bengal, India
     >     Emails: dan...@utexas.edu <mailto:dan...@utexas.edu>
    <mailto:dan...@utexas.edu <mailto:dan...@utexas.edu>>,
     > a...@phys.buruniv.ac.in <mailto:a...@phys.buruniv.ac.in>
    <mailto:a...@phys.buruniv.ac.in <mailto:a...@phys.buruniv.ac.in>>,
     > hariseldo...@gmail.com <mailto:hariseldo...@gmail.com>
    <mailto:hariseldo...@gmail.com <mailto:hariseldo...@gmail.com>>
     >     Webpage: http://www.ph.utexas.edu/~daneel/
    <http://www.ph.utexas.edu/~daneel/>
     >     <http://www.ph.utexas.edu/~daneel/
    <http://www.ph.utexas.edu/~daneel/>>
     >
     >
     >
     > --
     > Analabha Roy
     > Assistant Professor
     > Department of Physics
     > <http://www.buruniv.ac.in/academics/department/physics
    <http://www.buruniv.ac.in/academics/department/physics>>
     > The University of Burdwan <http://www.buruniv.ac.in/
    <http://www.buruniv.ac.in/>>
     > Golapbag Campus, Barddhaman 713104
     > West Bengal, India
     > Emails: dan...@utexas.edu <mailto:dan...@utexas.edu>
    <mailto:dan...@utexas.edu <mailto:dan...@utexas.edu>>,
     > a...@phys.buruniv.ac.in <mailto:a...@phys.buruniv.ac.in>
    <mailto:a...@phys.buruniv.ac.in <mailto:a...@phys.buruniv.ac.in>>,
     > hariseldo...@gmail.com <mailto:hariseldo...@gmail.com>
    <mailto:hariseldo...@gmail.com <mailto:hariseldo...@gmail.com>>
     > Webpage: http://www.ph.utexas.edu/~daneel/
    <http://www.ph.utexas.edu/~daneel/>
     > <http://www.ph.utexas.edu/~daneel/
    <http://www.ph.utexas.edu/~daneel/>>

--Diego Zuccato

    DIFA - Dip. di Fisica e Astronomia
    Servizi Informatici
    Alma Mater Studiorum - Università di Bologna
    V.le Berti-Pichat 6/2 - 40127 Bologna - Italy
    tel.: +39 051 20 95786


--
Diego Zuccato
DIFA - Dip. di Fisica e Astronomia
Servizi Informatici
Alma Mater Studiorum - Università di Bologna
V.le Berti-Pichat 6/2 - 40127 Bologna - Italy
tel.: +39 051 20 95786

Re: [slurm-users] [External] Hibernating a whole cluster

Reply via email to