On 7/28/22 18:49, Djamil Lakhdar-Hamina wrote:
I am helping set up a 16 node cluster computing system, I am not a
system-admin but I work for a small firm and unfortunately have to pick
up needed skills fast in things I have little experience in. I am
running Rocky Linux 8 on Intel Xeon Knights Landings nodes donated by
the TAAC center. We are operating in Uganda where we have limited
resources and where power is quite expensive.
What are some good ways to implement power-saving ? I have already tried
power saving as per slurms power saving guide but 1) I am not quite sure
what it does and 2) in implementing a version on my virtual dev
environment I was able to get the power saving to stand down nodes, but
I was not able to get the power saving mechanism to spin them back up
when needed. I put power saving in the slurm.cfg file, and I also
specified a SuspendProgram and a ResumeProgram similar to the one in the
https://slurm.schedmd.com/power_save.html
<https://slurm.schedmd.com/power_save.html>.
You might also look at Varorium:
https://variorum.readthedocs.io/en/latest/api/cap_functions.html
https://github.com/LLNL/variorum
So 1) how do I get this power saving mechanism to work, what exactly
will it do, I see it stands nodes down, will it spin them back up on
request of those resources? 2) Are there any better techniques for power
saving, say using IPMItool or something?
Sincerely,
Djamil Lakhdar-Hamina