Doug,

I don't think thermal cycling is as big of an issue as it used to be. From what I've always been told/read, the biggest problem with thermal cycling was "chip creep", where the expansion/contraction of a chip in a socket would cause the chip to eventually work itself loose enough to cause faulty connections. 20+ years ago, I remember looking at motherboards with chips inserted into sockets. On a modern motherboard, just about everything is soldered to the motherboard, except the CPU and DIMMs. The CPUs are usually locked securely into place so chip creep won't happen, and the DIMMs have a latching mechanism, although anyone who has every reseated a DIMM to fix a problem knows that mechanism isn't perfect.

As someone else has pointed out, components with moving parts, like spinning disks are at higher risk of failure. Here, too, that risk is disappearing, as SSDs are becoming more common, with even NVMe drives available in servers.

I know they there is a direct relationship between system failure and operating temperature, but I don't know if that applies to all components, or just those with moving parts. Someone  somewhere must have done research on this. I know Google did research on hard drive failure that was pretty popular. I would imagine they would have researched this, too.

As an example, when I managed an IBM Blue Gene/P, I remember IBM touting that all the components on a node (which was only the size of a PCI card) were soldered to the board - nothing was plugged into a socket. This was to completely eliminate chip creep and increase reliability. Also, the BG/P would shutdown nodes between jobs, just as your asking about here. If there was another job waiting in the queue for those nodes, the nodes would at least reboot between every job.

I do have to say that even though my BG/P was small for a Blue Gene, it still had 2048 nodes, and given that number of nodes, I had extremely few hardware problems at the node-level, so there's something to be said for that logic. I did, however, have to occasionally reseat a node into a node card, which is the same as reseating a DIMM or a PCI card in a regular server.

Prentice


7/16/21 3:35 PM, Douglas Eadline wrote:
Hi everyone:

Reducing power use has become an important topic. One
of the questions I always wondered about is
why more cluster do not turn off unused nodes. Slurm
has hooks to turn nodes off when not in use and
turn them on when resources are needed.

My understanding is that power cycling creates
temperature cycling, that then leads to premature node
failure. Makes sense and has anyone ever studied/tested
this ?

The only other reason I can think of is that the delay
in server boot time makes job starts slow or power
surge issues.

I'm curious about other ideas or experiences.

Thanks

--
Doug


_______________________________________________
Beowulf mailing list, Beowulf@beowulf.org sponsored by Penguin Computing
To change your subscription (digest mode or unsubscribe) visit 
https://beowulf.org/cgi-bin/mailman/listinfo/beowulf

Reply via email to