Re: [Beowulf] [External] Power Cycling Question

Prentice Bisbal via Beowulf Mon, 19 Jul 2021 09:12:47 -0700

Doug,

I don't think thermal cycling is as big of an issue as it used to be.From what I've always been told/read, the biggest problem with thermalcycling was "chip creep", where the expansion/contraction of a chip in asocket would cause the chip to eventually work itself loose enough tocause faulty connections. 20+ years ago, I remember looking atmotherboards with chips inserted into sockets. On a modern motherboard,just about everything is soldered to the motherboard, except the CPU andDIMMs. The CPUs are usually locked securely into place so chip creepwon't happen, and the DIMMs have a latching mechanism, although anyonewho has every reseated a DIMM to fix a problem knows that mechanismisn't perfect.

As someone else has pointed out, components with moving parts, likespinning disks are at higher risk of failure. Here, too, that risk isdisappearing, as SSDs are becoming more common, with even NVMe drivesavailable in servers.

I know they there is a direct relationship between system failure andoperating temperature, but I don't know if that applies to allcomponents, or just those with moving parts. Someone somewhere musthave done research on this. I know Google did research on hard drivefailure that was pretty popular. I would imagine they would haveresearched this, too.

As an example, when I managed an IBM Blue Gene/P, I remember IBM toutingthat all the components on a node (which was only the size of a PCIcard) were soldered to the board - nothing was plugged into a socket.This was to completely eliminate chip creep and increase reliability.Also, the BG/P would shutdown nodes between jobs, just as your askingabout here. If there was another job waiting in the queue for thosenodes, the nodes would at least reboot between every job.

I do have to say that even though my BG/P was small for a Blue Gene, itstill had 2048 nodes, and given that number of nodes, I had extremelyfew hardware problems at the node-level, so there's something to be saidfor that logic. I did, however, have to occasionally reseat a node intoa node card, which is the same as reseating a DIMM or a PCI card in aregular server.


Prentice


7/16/21 3:35 PM, Douglas Eadline wrote:

Hi everyone:

Reducing power use has become an important topic. One
of the questions I always wondered about is
why more cluster do not turn off unused nodes. Slurm
has hooks to turn nodes off when not in use and
turn them on when resources are needed.

My understanding is that power cycling creates
temperature cycling, that then leads to premature node
failure. Makes sense and has anyone ever studied/tested
this ?

The only other reason I can think of is that the delay
in server boot time makes job starts slow or power
surge issues.

I'm curious about other ideas or experiences.

Thanks

--
Doug


_______________________________________________
Beowulf mailing list, Beowulf@beowulf.org sponsored by Penguin Computing
To change your subscription (digest mode or unsubscribe) visit 
https://beowulf.org/cgi-bin/mailman/listinfo/beowulf

Re: [Beowulf] [External] Power Cycling Question

Reply via email to