Thank you very much for the many experiences shared - especially for
pointing out how RAM requirements may grow over time!
Our instances can vary wildly from 2 GB (rather unreasonable for Slurm)
to multiple TB of RAM and given that we only provide resources and tools
but not manage the running clusters, we cannot readjust values once the
cluster has started.
Currently, I am considering using CR_Core_Memory with node configurations:
RealMemory=node_memory
MemSpecLimit=min(node_memory//4 + 1000, 8000) # MB
This would result in:
2 -> 1 GB reserved (which is unreasonable small anyway)
4 -> 2 GB
8 -> 3 GB
16 -> 5 GB
32 -> 8 GB
... -> 8 GB
This tries to respect that smaller instances are not really able to give
much RAM to the system and I know that especially in the 2 GB RAM
instances this probably will lead to OOM terminations, but if I reserve
the whole 2 GB for Slurm, there's not much to compute with. We will add
a warning that instances below 4 GB RAM are not really feasible as
worker nodes. I feel like we will only be able to improve that formula
with more experience.
Best regards
Xaver
On 5/12/25 21:30, Timony, Mick via slurm-users wrote:
We do something very similar at HMS. For instance our nodes with
257468MB of RAM we round down RealMemory to 257000MB, for nodes with
1031057MB of RAM we round down to 1000000 etc.
We may tune this on our next OS and Slurm update as I expect to see
more memory used by the OS as we migrating to RHEL9.
Cheers
--
Mick Timony
Senior DevOps Engineer
LASER, Longwood, & O2 Cluster Admin
Harvard Medical School
--
------------------------------------------------------------------------
*From:* Paul Edmon via slurm-users <slurm-users@lists.schedmd.com>
*Sent:* Monday, May 12, 2025 10:14 AM
*To:* slurm-users@lists.schedmd.com <slurm-users@lists.schedmd.com>
*Subject:* [slurm-users] Re: Do I have to hold back RAM for worker nodes?
The way we typically do it here is that we look at the idle memory
usage of the system by the OS and then reserve the nearest power of 2
for that. For instance right now we have 16 GB set for our
MemSpecLimit. That may seem like a lot but our nodes typically have 1
TB of memory so 16 GB is not that much. The newer hardware tends to
eat up more base memory, at least from my experience.
-Paul Edmon-
On 5/12/25 8:55 AM, Xaver Stiensmeier via slurm-users wrote:
Josh,
thank you for your thorough answer. I, too, considered switching to
CR_Core_Memory after reading into this. Thank you for confirming my
suspicion that without Memory, we cannot handle high memory requests
adequately.
If I may ask: *How do you come up with the specific MemSpecLimit?* Do
you handpick a value for each node, have you picked a constant value
for all nodes or do you take a capped percentage of the maximum
memory available?
Best regards,
Xaver
On 5/12/25 14:43, Joshua Randall wrote:
Xaver,
It is my understanding that if we want to have stable systems that
don't run out of memory, we do need to manage the amount of memory
needed for everything not running within a slurm job, yes.
In our cluster, we are using `CR_Core_Memory` (so we do constrain
our job memory) and we set the `RealMemory` to the actual full
amount of memory available on the machine - I believe these really
are given in megabytes (MB), not mebibytes (MiB). I think their
example of (e.g. "2048") is intended to convey this because 2000 MiB
is 2048 MB. We set the `MemSpecLimit` for each node to set memory
aside for everything in the system that is not running within a
slurm job. This include the slurm daemon itself, the kernel,
filesystem drivers, metrics collection agents, etc -- anything else
we are running outside the control of slurm jobs. The `MemSpecLimit`
just sets aside the specified amount and the result will be that the
maximum memory jobs can use on the node is (RealMemory -
MemSpecLimit). When using cgroups to limit memory, slurmd will also
be allocated the specified limit so that the slurm daemon cannot
encroach on job memory. However, note that `MemSpecLimit` is
documented to not work unless your `SelectTypeParameters` includes
Memory as a consumable resource.
Since you are using `CR_Core` (which does not configure Memory as a
consumable resource) then I believe your system will not be
constraining job memory at all. Jobs can oversubscribe memory as
many times over as there are cores, and any job would be able to run
the machine out of memory by using more than is available. With this
setting, I guess you could say you don't have to manage reserving
memory for the OS and slurmd, but only in the sense that any job
could consume all the memory and cause the system OOM killer to kill
a random process (including slurmd or something else system critical).
Cheers,
Josh.
--
Dr. Joshua C. Randall
Director of Software Engineering, HPC
Altos Labs
email: jrand...@altoslabs.com <mailto:jrand...@altoslabs.com>
On Mon, May 12, 2025 at 10:27 AM Xaver Stiensmeier via slurm-users
<slurm-users@lists.schedmd.com
<mailto:slurm-users@lists.schedmd.com>> wrote:
Dear Slurm-User List,
currently, in our slurm.conf, we are setting:
SelectType: select/cons_tres
SelectTypeParameters: CR_Core
and in our node configuration /RealMemory /was basically reduced
by an amount to make sure the node always had enough RAM to run
the OS. However, this is apparently now how it is supposed to be
done:
Lowering RealMemory with the goal of setting aside some
amount for the OS and not available for job allocations will
not work as intended if Memory is not set as a consumable
resource in *SelectTypeParameters*. So one of the *_Memory
options need to be enabled for that goal to be accomplished.
(https://slurm.schedmd.com/slurm.conf.html#OPT_RealMemory
<https://urldefense.proofpoint.com/v2/url?u=https-3A__slurm.schedmd.com_slurm.conf.html-23OPT-5FRealMemory&d=DwMDaQ&c=WO-RGvefibhHBZq3fL85hQ&r=VdVezmCbZuLlhdKBk1emX2rlpWZ2DrL3v-wR0vX7eA4&m=uWfwtSoKOUql_gHnagRQ1iXIplN-ab-SPDXMtxzWL_xGgebUr9rz7ctqSRwrDk6E&s=4OAZTnfOG07jihvHaqYdDTipX3YbDZrDtvb1UzGmFcg&e=>)
This leads to four questions regarding holding back RAM for
worker nodes. Answers/help with any of those questions would be
appreciated.
*1.* Is reserving enough RAM for the worker node's OS and
slurmd actually a thing you have to manage?
*2.* If so how can we reserve enough RAM for the worker
node's OS and slurmd when using CR_Core?
*3.* Is that maybe a strong argument against using CR_Core
that we overlooked?
And semi-related:
https://slurm.schedmd.com/slurm.conf.html#OPT_RealMemory
<https://urldefense.proofpoint.com/v2/url?u=https-3A__slurm.schedmd.com_slurm.conf.html-23OPT-5FRealMemory&d=DwMDaQ&c=WO-RGvefibhHBZq3fL85hQ&r=VdVezmCbZuLlhdKBk1emX2rlpWZ2DrL3v-wR0vX7eA4&m=uWfwtSoKOUql_gHnagRQ1iXIplN-ab-SPDXMtxzWL_xGgebUr9rz7ctqSRwrDk6E&s=4OAZTnfOG07jihvHaqYdDTipX3YbDZrDtvb1UzGmFcg&e=>
talks about taking a value in megabytes.
*4.* Is RealMemory really expecting megabytes or is it
mebibytes?
Best regards,
Xaver
--
slurm-users mailing list -- slurm-users@lists.schedmd.com
<mailto:slurm-users@lists.schedmd.com>
To unsubscribe send an email to
slurm-users-le...@lists.schedmd.com
<mailto:slurm-users-le...@lists.schedmd.com>
Altos Labs UK Limited | England | Company reg 13484917
Registered address: 3rd Floor 1 Ashley Road, Altrincham, Cheshire,
United Kingdom, WA14 2DT
--
slurm-users mailing list -- slurm-users@lists.schedmd.com
To unsubscribe send an email to slurm-users-le...@lists.schedmd.com