[slurm-users] Re: Do I have to hold back RAM for worker nodes?

Patrick Begou via slurm-users Tue, 13 May 2025 05:29:22 -0700

Hi all,

another point you may notice is the growing size of /dev/shm when MPIjobs do not exit properly. This also has a cost in system memory onsmall configurations, even if this storage is limited in size.I'm using a cron to clean this periodically. I'm not the author, seehttps://docs.hpc.udel.edu/technical/whitepaper/automated_devshm_cleanupfor the tool.


Patrick

Le 12/05/2025 à 19:04, Stijn De Weirdt via slurm-users a écrit :

hi all,
we are currently going through the process of reviewing our limitsafter subtle OOM issues that had nothing to do with jobs. we found outthat idle (just rebooted) nodes were not representative for nodes thatwere running for a while: gpfs mmfsd was using up to 2.5GB extra,rsyslogd was also growing (0.7GB) and best of all journalctld wasstoring it's journal data in tmpfs (and we had no limits set, but thatwas on us ;). also, vm.min_free_kbytes probably also doesn't do whatyou think it does, so add it to the list of overhead if it's largeenough.
anyway, i'd say there is no real way to predict this. we areconsidering to use eg 8GB as overhead, and monitorfree-min_free_kbytes. if the minimum value is much larger then 0 overlong period of time, one could consider increasing the slurm value (iegiving more back to the users).
(kudos to my colleague ivo for most of this)

stijn

On 5/12/25 18:21, Patrick Begou via slurm-users wrote:
Hi,
When deploying slurm and having some trouble to start slurmd onnodes, I found an interesting command to check the memory size seenby slurm on a compute node:
sudo slurmd -C
This could be helpful. Then I have set the memory size of the node alittle be lower to avoid running out-of-memory, specifically when auser allocate the full node.
Patrick


Le 12/05/2025 à 14:55, Xaver Stiensmeier via slurm-users a écrit :
Josh,
thank you for your thorough answer. I, too, considered switching toCR_Core_Memory after reading into this. Thank you for confirming mysuspicion that without Memory, we cannot handle high memory requestsadequately.
If I may ask: *How do you come up with the specific MemSpecLimit?*Do you handpick a value for each node, have you picked a constantvalue for all nodes or do you take a capped percentage of themaximum memory available?
Best regards, Xaver

On 5/12/25 14:43, Joshua Randall wrote:
Xaver,
It is my understanding that if we want to have stable systems thatdon't run out of memory, we do need to manage the amount of memoryneeded for everything not running within a slurm job, yes.
In our cluster, we are using `CR_Core_Memory` (so we do constrainour job memory) and we set the `RealMemory` to the actual fullamount of memory available on the machine - I believe these reallyare given in megabytes (MB), not mebibytes (MiB). I think theirexample of (e.g. "2048") is intended to convey this because 2000MiB is 2048 MB. We set the `MemSpecLimit` for each node to setmemory aside for everything in the system that is not runningwithin a slurm job. This include the slurm daemon itself, thekernel, filesystem drivers, metrics collection agents, etc --anything else we are running outside the control of slurm jobs. The`MemSpecLimit` just sets aside the specified amount and the resultwill be that the maximum memory jobs can use on the node is(RealMemory - MemSpecLimit). When using cgroups to limit memory,slurmd will also be allocated the specified limit so that the slurmdaemon cannot encroach on job memory. However, note that`MemSpecLimit` is documented to not work unless your`SelectTypeParameters` includes Memory as a consumable resource.
Since you are using `CR_Core` (which does not configure Memory as aconsumable resource) then I believe your system will not beconstraining job memory at all. Jobs can oversubscribe memory asmany times over as there are cores, and any job would be able torun the machine out of memory by using more than is available. Withthis setting, I guess you could say you don't have to managereserving memory for the OS and slurmd, but only in the sense thatany job could consume all the memory and cause the system OOMkiller to kill a random process (including slurmd or something elsesystem critical).
Cheers,

Josh.


--
Dr. Joshua C. Randall
Director of Software Engineering, HPC
Altos Labs
email: jrand...@altoslabs.com
On Mon, May 12, 2025 at 10:27 AM Xaver Stiensmeier via slurm-users<slurm-users@lists.schedmd.com> wrote:
    Dear Slurm-User List,

    currently, in our slurm.conf, we are setting:

        SelectType: select/cons_tres
        SelectTypeParameters: CR_Core

    and in our node configuration /RealMemory /was basically reduced
    by an amount to make sure the node always had enough RAM to run
    the OS. However, this is apparently now how it is supposed to be
    done:

        Lowering RealMemory with the goal of setting aside some
        amount for the OS and not available for job allocations will
        not work as intended if Memory is not set as a consumable
        resource in *SelectTypeParameters*. So one of the *_Memory
        options need to be enabled for that goal to be accomplished.
(https://slurm.schedmd.com/slurm.conf.html#OPT_RealMemory)

    This leads to four questions regarding holding back RAM for
    worker nodes. Answers/help with any of those questions would be
    appreciated.

        *1.* Is reserving enough RAM for the worker node's OS and
        slurmd actually a thing you have to manage? *2.* If so how
        can we reserve enough RAM for the worker node's OS and slurmd
        when using CR_Core? *3.* Is that maybe a strong argument
        against using CR_Core that we overlooked?

    And semi-related: https://slurm.schedmd.com/
    slurm.conf.html#OPT_RealMemory talks about taking a value in
    megabytes.
*4.* Is RealMemory really expecting megabytes or is itmebibytes?
    Best regards, Xaver


    --     slurm-users mailing list -- slurm-users@lists.schedmd.com
To unsubscribe send an email toslurm-users-le...@lists.schedmd.com
Altos Labs UK Limited | England | Company reg 13484917
Registered address: 3rd Floor 1 Ashley Road, Altrincham, Cheshire,United Kingdom, WA14 2DT



--
slurm-users mailing list -- slurm-users@lists.schedmd.com
To unsubscribe send an email to slurm-users-le...@lists.schedmd.com

[slurm-users] Re: Do I have to hold back RAM for worker nodes?

Reply via email to