Doug,
I don't think thermal cycling is as big of an issue as it used to be.
From what I've always been told/read, the biggest problem with thermal
cycling was "chip creep", where the expansion/contraction of a chip in a
socket would cause the chip to eventually work itself loose enough to
cause faulty connections. 20+ years ago, I remember looking at
motherboards with chips inserted into sockets. On a modern motherboard,
just about everything is soldered to the motherboard, except the CPU and
DIMMs. The CPUs are usually locked securely into place so chip creep
won't happen, and the DIMMs have a latching mechanism, although anyone
who has every reseated a DIMM to fix a problem knows that mechanism
isn't perfect.
As someone else has pointed out, components with moving parts, like
spinning disks are at higher risk of failure. Here, too, that risk is
disappearing, as SSDs are becoming more common, with even NVMe drives
available in servers.
I know they there is a direct relationship between system failure and
operating temperature, but I don't know if that applies to all
components, or just those with moving parts. Someone somewhere must
have done research on this. I know Google did research on hard drive
failure that was pretty popular. I would imagine they would have
researched this, too.
As an example, when I managed an IBM Blue Gene/P, I remember IBM touting
that all the components on a node (which was only the size of a PCI
card) were soldered to the board - nothing was plugged into a socket.
This was to completely eliminate chip creep and increase reliability.
Also, the BG/P would shutdown nodes between jobs, just as your asking
about here. If there was another job waiting in the queue for those
nodes, the nodes would at least reboot between every job.
I do have to say that even though my BG/P was small for a Blue Gene, it
still had 2048 nodes, and given that number of nodes, I had extremely
few hardware problems at the node-level, so there's something to be said
for that logic. I did, however, have to occasionally reseat a node into
a node card, which is the same as reseating a DIMM or a PCI card in a
regular server.
Prentice
7/16/21 3:35 PM, Douglas Eadline wrote:
Hi everyone:
Reducing power use has become an important topic. One
of the questions I always wondered about is
why more cluster do not turn off unused nodes. Slurm
has hooks to turn nodes off when not in use and
turn them on when resources are needed.
My understanding is that power cycling creates
temperature cycling, that then leads to premature node
failure. Makes sense and has anyone ever studied/tested
this ?
The only other reason I can think of is that the delay
in server boot time makes job starts slow or power
surge issues.
I'm curious about other ideas or experiences.
Thanks
--
Doug
_______________________________________________
Beowulf mailing list, Beowulf@beowulf.org sponsored by Penguin Computing
To change your subscription (digest mode or unsubscribe) visit
https://beowulf.org/cgi-bin/mailman/listinfo/beowulf