We do something very similar at HMS.  For instance our nodes with 257468MB of 
RAM we round down RealMemory to 257000MB, for nodes with 1031057MB of RAM we 
round down to 1000000 etc.

We may tune this on our next OS and Slurm update as I expect to see more memory 
used by the OS as we migrating to RHEL9.

Cheers

--
Mick Timony
Senior DevOps Engineer
LASER, Longwood, & O2 Cluster Admin
Harvard Medical School
--
________________________________
From: Paul Edmon via slurm-users <slurm-users@lists.schedmd.com>
Sent: Monday, May 12, 2025 10:14 AM
To: slurm-users@lists.schedmd.com <slurm-users@lists.schedmd.com>
Subject: [slurm-users] Re: Do I have to hold back RAM for worker nodes?


The way we typically do it here is that we look at the idle memory usage of the 
system by the OS and then reserve the nearest power of 2 for that. For instance 
right now we have 16 GB set for our MemSpecLimit. That may seem like a lot but 
our nodes typically have 1 TB of memory so 16 GB is not that much. The newer 
hardware tends to eat up more base memory, at least from my experience.

-Paul Edmon-

On 5/12/25 8:55 AM, Xaver Stiensmeier via slurm-users wrote:

Josh,

thank you for your thorough answer. I, too, considered switching to 
CR_Core_Memory after reading into this. Thank you for confirming my suspicion 
that without Memory, we cannot handle high memory requests adequately.

If I may ask: How do you come up with the specific MemSpecLimit? Do you 
handpick a value for each node, have you picked a constant value for all nodes 
or do you take a capped percentage of the maximum memory available?

Best regards,
Xaver

On 5/12/25 14:43, Joshua Randall wrote:
Xaver,

It is my understanding that if we want to have stable systems that don't run 
out of memory, we do need to manage the amount of memory needed for everything 
not running within a slurm job, yes.

In our cluster, we are using `CR_Core_Memory` (so we do constrain our job 
memory) and we set the `RealMemory` to the actual full amount of memory 
available on the machine - I believe these really are given in megabytes (MB), 
not mebibytes (MiB). I think their example of (e.g. "2048") is intended to 
convey this because 2000 MiB is 2048 MB. We set the `MemSpecLimit` for each 
node to set memory aside for everything in the system that is not running 
within a slurm job. This include the slurm daemon itself, the kernel, 
filesystem drivers, metrics collection agents, etc -- anything else we are 
running outside the control of slurm jobs. The `MemSpecLimit` just sets aside 
the specified amount and the result will be that the maximum memory jobs can 
use on the node is (RealMemory - MemSpecLimit). When using cgroups to limit 
memory, slurmd will also be allocated the specified limit so that the slurm 
daemon cannot encroach on job memory. However, note that `MemSpecLimit` is 
documented to not work unless your `SelectTypeParameters` includes Memory as a 
consumable resource.

Since you are using `CR_Core` (which does not configure Memory as a consumable 
resource) then I believe your system will not be constraining job memory at 
all. Jobs can oversubscribe memory as many times over as there are cores, and 
any job would be able to run the machine out of memory by using more than is 
available. With this setting, I guess you could say you don't have to manage 
reserving memory for the OS and slurmd, but only in the sense that any job 
could consume all the memory and cause the system OOM killer to kill a random 
process (including slurmd or something else system critical).

Cheers,

Josh.


--
Dr. Joshua C. Randall
Director of Software Engineering, HPC
Altos Labs
email: jrand...@altoslabs.com<mailto:jrand...@altoslabs.com>



On Mon, May 12, 2025 at 10:27 AM Xaver Stiensmeier via slurm-users 
<slurm-users@lists.schedmd.com<mailto:slurm-users@lists.schedmd.com>> wrote:

Dear Slurm-User List,

currently, in our slurm.conf, we are setting:

SelectType: select/cons_tres
SelectTypeParameters: CR_Core

and in our node configuration RealMemory was basically reduced by an amount to 
make sure the node always had enough RAM to run the OS. However, this is 
apparently now how it is supposed to be done:

Lowering RealMemory with the goal of setting aside some amount for the OS and 
not available for job allocations will not work as intended if Memory is not 
set as a consumable resource in SelectTypeParameters. So one of the *_Memory 
options need to be enabled for that goal to be accomplished. 
(https://slurm.schedmd.com/slurm.conf.html#OPT_RealMemory<https://urldefense.proofpoint.com/v2/url?u=https-3A__slurm.schedmd.com_slurm.conf.html-23OPT-5FRealMemory&d=DwMDaQ&c=WO-RGvefibhHBZq3fL85hQ&r=VdVezmCbZuLlhdKBk1emX2rlpWZ2DrL3v-wR0vX7eA4&m=uWfwtSoKOUql_gHnagRQ1iXIplN-ab-SPDXMtxzWL_xGgebUr9rz7ctqSRwrDk6E&s=4OAZTnfOG07jihvHaqYdDTipX3YbDZrDtvb1UzGmFcg&e=>)

This leads to four questions regarding holding back RAM for worker nodes. 
Answers/help with any of those questions would be appreciated.

1. Is reserving enough RAM for the worker node's OS and slurmd actually a thing 
you have to manage?
2. If so how can we reserve enough RAM for the worker node's OS and slurmd when 
using CR_Core?
3. Is that maybe a strong argument against using CR_Core that we overlooked?

And semi-related: 
https://slurm.schedmd.com/slurm.conf.html#OPT_RealMemory<https://urldefense.proofpoint.com/v2/url?u=https-3A__slurm.schedmd.com_slurm.conf.html-23OPT-5FRealMemory&d=DwMDaQ&c=WO-RGvefibhHBZq3fL85hQ&r=VdVezmCbZuLlhdKBk1emX2rlpWZ2DrL3v-wR0vX7eA4&m=uWfwtSoKOUql_gHnagRQ1iXIplN-ab-SPDXMtxzWL_xGgebUr9rz7ctqSRwrDk6E&s=4OAZTnfOG07jihvHaqYdDTipX3YbDZrDtvb1UzGmFcg&e=>
 talks about taking a value in megabytes.

4. Is RealMemory really expecting megabytes or is it mebibytes?

Best regards,
Xaver

--
slurm-users mailing list -- 
slurm-users@lists.schedmd.com<mailto:slurm-users@lists.schedmd.com>
To unsubscribe send an email to 
slurm-users-le...@lists.schedmd.com<mailto:slurm-users-le...@lists.schedmd.com>

Altos Labs UK Limited | England | Company reg 13484917
Registered address: 3rd Floor 1 Ashley Road, Altrincham, Cheshire, United 
Kingdom, WA14 2DT





-- 
slurm-users mailing list -- slurm-users@lists.schedmd.com
To unsubscribe send an email to slurm-users-le...@lists.schedmd.com

Reply via email to