On Tue, 7 Feb 2023, 18:12 Diego Zuccato, <diego.zucc...@unibo.it> wrote:
> RAM used by a suspended job is not released. At most it can be swapped > out (if enough swap is available). > There should be enough swap available. I have 93 gigs of Ram and as big a swap partition. I can top it off with swap files if needed. > > Il 07/02/2023 13:14, Analabha Roy ha scritto: > > Hi Sean, > > > > Thanks for your awesome suggestion! I'm going through the reservation > > docs now. At first glance, it seems like a daily reservation would turn > > down jobs that are too big for the reservation. It'd be nice if > > slurm could suspend (in the manner of 'scontrol suspend') jobs during > > reserved downtime and resume them after. That way, folks can submit > > large jobs without having to worry about the downtimes. Perhaps the FLEX > > option in reservations can accomplish this somehow? > > > > > > I suppose that I can do it using a shell script iterator and a cron job, > > but that seems like an ugly hack. I was hoping if there is a way to > > config this in slurm itself? > > > > AR > > > > On Tue, 7 Feb 2023 at 16:06, Sean Mc Grath <smcg...@tcd.ie > > <mailto:smcg...@tcd.ie>> wrote: > > > > Hi Analabha, > > > > Could you do something like create a daily reservation for 8 hours > > that starts at 9am, or whatever times work for you like the > > following untested command: > > > > scontrol create reservation starttime=09:00:00 duration=8:00:00 > > nodecnt=1 flags=daily ReservationName=daily > > > > Daily option at https://slurm.schedmd.com/scontrol.html#OPT_DAILY > > <https://slurm.schedmd.com/scontrol.html#OPT_DAILY> > > > > Some more possible helpful documentation at > > https://slurm.schedmd.com/reservations.html > > <https://slurm.schedmd.com/reservations.html>, search for "daily". > > > > My idea being that jobs can only run in that reservation, (that > > would have to be configured separately, not sure how from the top of > > my head), which is only active during the times you want the node to > > be working. So the cronjob that hibernates/shuts it down will do so > > when there are no jobs running. At least in theory. > > > > Hope that helps. > > > > Sean > > > > --- > > Sean McGrath > > Senior Systems Administrator, IT Services > > > > > ------------------------------------------------------------------------ > > *From:* slurm-users <slurm-users-boun...@lists.schedmd.com > > <mailto:slurm-users-boun...@lists.schedmd.com>> on behalf of > > Analabha Roy <hariseldo...@gmail.com <mailto:hariseldo...@gmail.com > >> > > *Sent:* Tuesday 7 February 2023 10:05 > > *To:* Slurm User Community List <slurm-users@lists.schedmd.com > > <mailto:slurm-users@lists.schedmd.com>> > > *Subject:* Re: [slurm-users] [External] Hibernating a whole cluster > > Hi, > > > > Thanks. I had read the Slurm Power Saving Guide before. I believe > > the configs enable slurmctld to check other nodes for idleness and > > suspend/resume them. Slurmctld must run on a separate, always-on > > server for this to work, right? > > > > My issue might be a little different. I literally have only one node > > that runs everything: slurmctld, slurmd, slurmdbd, everything. > > > > This node must be set to "sudo systemctl hibernate"after business > > hours, regardless of whether jobs are queued or running. The next > > business day, it can be switched on manually. > > > > systemctl hibernate is supposed to save the entire run state of the > > sole node to swap and poweroff. When powered on again, it should > > restore everything to its previous running state. > > > > When the job queue is empty, this works well. I'm not sure how well > > this hibernate/resume will work with running jobs and would > > appreciate any suggestions or insights. > > > > AR > > > > > > On Tue, 7 Feb 2023 at 01:39, Florian Zillner <fzill...@lenovo.com > > <mailto:fzill...@lenovo.com>> wrote: > > > > Hi, > > > > follow this guide: https://slurm.schedmd.com/power_save.html > > <https://slurm.schedmd.com/power_save.html> > > > > Create poweroff / poweron scripts and configure slurm to do the > > poweroff after X minutes. Works well for us. Make sure to set an > > appropriate time (ResumeTimeout) to allow the node to come back > > to service. > > Note that we did not achieve good power saving with suspending > > the nodes, powering them off and on saves way more power. The > > downside is it takes ~ 5 mins to resume (= power on) the nodes > > when needed. > > > > Cheers, > > Florian > > > ------------------------------------------------------------------------ > > *From:* slurm-users <slurm-users-boun...@lists.schedmd.com > > <mailto:slurm-users-boun...@lists.schedmd.com>> on behalf of > > Analabha Roy <hariseldo...@gmail.com > > <mailto:hariseldo...@gmail.com>> > > *Sent:* Monday, 6 February 2023 18:21 > > *To:* slurm-users@lists.schedmd.com > > <mailto:slurm-users@lists.schedmd.com> > > <slurm-users@lists.schedmd.com > > <mailto:slurm-users@lists.schedmd.com>> > > *Subject:* [External] [slurm-users] Hibernating a whole cluster > > Hi, > > > > I've just finished setup of a single node "cluster" with slurm > > on ubuntu 20.04. Infrastructural limitations prevent me from > > running it 24/7, and it's only powered on during business hours. > > > > > > Currently, I have a cron job running that hibernates that sole > > node before closing time. > > > > The hibernation is done with standard systemd, and hibernates to > > the swap partition. > > > > I have not run any lengthy slurm jobs on it yet. Before I do, > > can I get some thoughts on a couple of things? > > > > If it hibernated when slurm still had jobs running/queued, would > > they resume properly when the machine powers back on? > > > > Note that my swap space is bigger than my RAM. > > > > Is it necessary to perhaps setup a pre-hibernate script for > > systemd to iterate scontrol to suspend all the jobs before > > hibernating and resume them post-resume? > > > > What about the wall times? I'm uessing that slurm will count the > > downtime as elapsed for each job. Is there a way to config this, > > or is the only alternative a post-hibernate script that > > iteratively updates the wall times of the running jobs using > > scontrol again? > > > > Thanks for your attention. > > Regards > > AR > > > > > > > > -- > > Analabha Roy > > Assistant Professor > > Department of Physics > > <http://www.buruniv.ac.in/academics/department/physics> > > The University of Burdwan <http://www.buruniv.ac.in/> > > Golapbag Campus, Barddhaman 713104 > > West Bengal, India > > Emails: dan...@utexas.edu <mailto:dan...@utexas.edu>, > > a...@phys.buruniv.ac.in <mailto:a...@phys.buruniv.ac.in>, > > hariseldo...@gmail.com <mailto:hariseldo...@gmail.com> > > Webpage: http://www.ph.utexas.edu/~daneel/ > > <http://www.ph.utexas.edu/~daneel/> > > > > > > > > -- > > Analabha Roy > > Assistant Professor > > Department of Physics > > <http://www.buruniv.ac.in/academics/department/physics> > > The University of Burdwan <http://www.buruniv.ac.in/> > > Golapbag Campus, Barddhaman 713104 > > West Bengal, India > > Emails: dan...@utexas.edu <mailto:dan...@utexas.edu>, > > a...@phys.buruniv.ac.in <mailto:a...@phys.buruniv.ac.in>, > > hariseldo...@gmail.com <mailto:hariseldo...@gmail.com> > > Webpage: http://www.ph.utexas.edu/~daneel/ > > <http://www.ph.utexas.edu/~daneel/> > > -- > Diego Zuccato > DIFA - Dip. di Fisica e Astronomia > Servizi Informatici > Alma Mater Studiorum - Università di Bologna > V.le Berti-Pichat 6/2 - 40127 Bologna - Italy > tel.: +39 051 20 95786 >