One problem with suspend/sleep is if you have services that depend on persistent TCP connections. I don't know that GPFS (er, sorry, "Spectrum Scale"), for instance, would be consistently tolerant of its daemon connections being interrupted, even if the node in question wasn't actually doing any I/O.
We had tried engineering our custom "green cluster" automation with Grid Engine years ago where we would shutdown idle nodes until they were needed, but doing it independently of the resource manager was far too complicated for us to maintain, especially since it was all cost and no benefit for us with our power and cooling charges being absorbed through a flat overhead rate. This might not be as big of an issue for schedulers/resource managers that have fewer requestable resources than GE, and for sites where they are billed for actual power/cooling used and can more easily justify the staff time to manage the extra complexity. On Sat, Jul 17, 2021 at 12:43:27AM +0100, Jörg Saßmannshausen wrote: > Hi Doug, > > interesting topic and quite apt when I look at the flooding in Germany, > Belgian and The Netherlands. > > I guess there are a number of reasons why people are not doing it. Discarding > the usual "we never done that" one, I guess the main problem is: when do you > want to turn it off? After 5 mins being idle? Maybe 10 mins? One hour? How > often do you then need to boot them up again and how much energy does that > cost? From chatting to a few people who tried it in the past it somehow > transpired that you do not save as much energy as you were hoping for. > > However, on thing came to my mind: is it possible to simply suspend it to > disc > and then let it be sleeping? That way, you wake the node up quicker and > probably need less power when it is suspended. Think of laptops. > > The other way around would simply be: we know in say the summer, there is > less > demand so we simply turn X number of nodes off and might do some maintenance > on them. So you are running the whole cluster for say 6 weeks with limited > capacity. That might mean a few jobs are queuing but that also will give us a > window to do things. Once people are coming back, the maintenance is done and > the cluster can run at full capacity again. > > Just some (crazy?) ideas. > > All the best > > Jörg > > Am Freitag, 16. Juli 2021, 20:35:11 BST schrieb Douglas Eadline: > > Hi everyone: > > > > Reducing power use has become an important topic. One > > of the questions I always wondered about is > > why more cluster do not turn off unused nodes. Slurm > > has hooks to turn nodes off when not in use and > > turn them on when resources are needed. > > > > My understanding is that power cycling creates > > temperature cycling, that then leads to premature node > > failure. Makes sense and has anyone ever studied/tested > > this ? > > > > The only other reason I can think of is that the delay > > in server boot time makes job starts slow or power > > surge issues. > > > > I'm curious about other ideas or experiences. > > > > Thanks > > > > -- > > Doug > > > > _______________________________________________ > Beowulf mailing list, Beowulf@beowulf.org sponsored by Penguin Computing > To change your subscription (digest mode or unsubscribe) visit > https://beowulf.org/cgi-bin/mailman/listinfo/beowulf -- Skylar _______________________________________________ Beowulf mailing list, Beowulf@beowulf.org sponsored by Penguin Computing To change your subscription (digest mode or unsubscribe) visit https://beowulf.org/cgi-bin/mailman/listinfo/beowulf