Hi all,
another point you may notice is the growing size of /dev/shm when MPI
jobs do not exit properly. This also has a cost in system memory on
small configurations, even if this storage is limited in size.
I'm using a cron to clean this periodically. I'm not the author, see
https://docs.hpc.udel.edu/technical/whitepaper/automated_devshm_cleanup
for the tool.
Patrick
Le 12/05/2025 à 19:04, Stijn De Weirdt via slurm-users a écrit :
hi all,
we are currently going through the process of reviewing our limits
after subtle OOM issues that had nothing to do with jobs. we found out
that idle (just rebooted) nodes were not representative for nodes that
were running for a while: gpfs mmfsd was using up to 2.5GB extra,
rsyslogd was also growing (0.7GB) and best of all journalctld was
storing it's journal data in tmpfs (and we had no limits set, but that
was on us ;). also, vm.min_free_kbytes probably also doesn't do what
you think it does, so add it to the list of overhead if it's large
enough.
anyway, i'd say there is no real way to predict this. we are
considering to use eg 8GB as overhead, and monitor
free-min_free_kbytes. if the minimum value is much larger then 0 over
long period of time, one could consider increasing the slurm value (ie
giving more back to the users).
(kudos to my colleague ivo for most of this)
stijn
On 5/12/25 18:21, Patrick Begou via slurm-users wrote:
Hi,
When deploying slurm and having some trouble to start slurmd on
nodes, I found an interesting command to check the memory size seen
by slurm on a compute node:
sudo slurmd -C
This could be helpful. Then I have set the memory size of the node a
little be lower to avoid running out-of-memory, specifically when a
user allocate the full node.
Patrick
Le 12/05/2025 à 14:55, Xaver Stiensmeier via slurm-users a écrit :
Josh,
thank you for your thorough answer. I, too, considered switching to
CR_Core_Memory after reading into this. Thank you for confirming my
suspicion that without Memory, we cannot handle high memory requests
adequately.
If I may ask: *How do you come up with the specific MemSpecLimit?*
Do you handpick a value for each node, have you picked a constant
value for all nodes or do you take a capped percentage of the
maximum memory available?
Best regards, Xaver
On 5/12/25 14:43, Joshua Randall wrote:
Xaver,
It is my understanding that if we want to have stable systems that
don't run out of memory, we do need to manage the amount of memory
needed for everything not running within a slurm job, yes.
In our cluster, we are using `CR_Core_Memory` (so we do constrain
our job memory) and we set the `RealMemory` to the actual full
amount of memory available on the machine - I believe these really
are given in megabytes (MB), not mebibytes (MiB). I think their
example of (e.g. "2048") is intended to convey this because 2000
MiB is 2048 MB. We set the `MemSpecLimit` for each node to set
memory aside for everything in the system that is not running
within a slurm job. This include the slurm daemon itself, the
kernel, filesystem drivers, metrics collection agents, etc --
anything else we are running outside the control of slurm jobs. The
`MemSpecLimit` just sets aside the specified amount and the result
will be that the maximum memory jobs can use on the node is
(RealMemory - MemSpecLimit). When using cgroups to limit memory,
slurmd will also be allocated the specified limit so that the slurm
daemon cannot encroach on job memory. However, note that
`MemSpecLimit` is documented to not work unless your
`SelectTypeParameters` includes Memory as a consumable resource.
Since you are using `CR_Core` (which does not configure Memory as a
consumable resource) then I believe your system will not be
constraining job memory at all. Jobs can oversubscribe memory as
many times over as there are cores, and any job would be able to
run the machine out of memory by using more than is available. With
this setting, I guess you could say you don't have to manage
reserving memory for the OS and slurmd, but only in the sense that
any job could consume all the memory and cause the system OOM
killer to kill a random process (including slurmd or something else
system critical).
Cheers,
Josh.
--
Dr. Joshua C. Randall
Director of Software Engineering, HPC
Altos Labs
email: jrand...@altoslabs.com
On Mon, May 12, 2025 at 10:27 AM Xaver Stiensmeier via slurm-users
<slurm-users@lists.schedmd.com> wrote:
Dear Slurm-User List,
currently, in our slurm.conf, we are setting:
SelectType: select/cons_tres
SelectTypeParameters: CR_Core
and in our node configuration /RealMemory /was basically reduced
by an amount to make sure the node always had enough RAM to run
the OS. However, this is apparently now how it is supposed to be
done:
Lowering RealMemory with the goal of setting aside some
amount for the OS and not available for job allocations will
not work as intended if Memory is not set as a consumable
resource in *SelectTypeParameters*. So one of the *_Memory
options need to be enabled for that goal to be accomplished.
(https://slurm.schedmd.com/slurm.conf.html#OPT_RealMemory)
This leads to four questions regarding holding back RAM for
worker nodes. Answers/help with any of those questions would be
appreciated.
*1.* Is reserving enough RAM for the worker node's OS and
slurmd actually a thing you have to manage? *2.* If so how
can we reserve enough RAM for the worker node's OS and slurmd
when using CR_Core? *3.* Is that maybe a strong argument
against using CR_Core that we overlooked?
And semi-related: https://slurm.schedmd.com/
slurm.conf.html#OPT_RealMemory talks about taking a value in
megabytes.
*4.* Is RealMemory really expecting megabytes or is it
mebibytes?
Best regards, Xaver
-- slurm-users mailing list -- slurm-users@lists.schedmd.com
To unsubscribe send an email to
slurm-users-le...@lists.schedmd.com
Altos Labs UK Limited | England | Company reg 13484917
Registered address: 3rd Floor 1 Ashley Road, Altrincham, Cheshire,
United Kingdom, WA14 2DT
--
slurm-users mailing list -- slurm-users@lists.schedmd.com
To unsubscribe send an email to slurm-users-le...@lists.schedmd.com