[slurm-users] Re: Nodes Become Invalid Due to Less Total RAM Than Expected

Brian Andrus via slurm-users Mon, 18 Aug 2025 07:50:27 -0700

Guillaume,

Jobs shouldn't fail if they are requesting the max amount of memory theyintend to use. If that is not there, the job would not start (perhapsthat is what you meant).

If they 'need' 100% of the available memory, you will definitely havesome issues, as the OS itself needs some of that memory. That is theidea behind the setting. It will also give you the buffer for the fewbytes that can deviate when doing 'slurmd -C' to read the memoryreported by the node.

I used to just truncate down to the nearest '00' (eg: 1675 became 1600)before the node_reg_mem_percent was shown to me. Now I just set that at95% which allows for any deviations that occur.


Brian Andrus

On 8/18/2025 4:27 AM, Xaver Stiensmeier via slurm-users wrote:

Hey,
while the *node_reg_mem_percent* parameter sounds interesting, itwould only be feasible for us on a per job basis (I wasn't able tofind it there at first glance). Many of our users need a certainamount of RAM and jobs would fail if they have less. Therefore, thisdoesn't solve our issue.
Best regards,
Xaver

On 8/14/25 10:42, Guillaume COCHARD via slurm-users wrote:
Hello,
You might want to use the *node_reg_mem_percent *parameter (https://slurm.schedmd.com/slurm.conf.html#OPT_node_reg_mem_percent ).For example, if set to 80, it will allow a node to work even if ithas only 80% of the declared memory.
Guillaume

------------------------------------------------------------------------
*De: *"Xaver Stiensmeier via slurm-users" <[email protected]>
*À: *[email protected]
*Envoyé: *Jeudi 14 Août 2025 10:01:26
*Objet: *[slurm-users] Nodes Become Invalid Due to Less Total RAMThan Expected
Dear slurm-user list,
in the past we had a bigger buffer between RealMemory<https://slurm.schedmd.com/slurm.conf.html#OPT_RealMemory> and theinstance memory. We then discovered that the right way is toactivating the *memory option* (SelectTypeParameters=CR_Core_Memory)and setting MemSpecLimit<https://slurm.schedmd.com/slurm.conf.html#OPT_MemSpecLimit> tosecure RAM for system processes.
However, now we run into the problem that due to *on demandscheduling*, we have to setup the slurm.conf in advance by using theRAM values from our flavors as reported by our cloud provider(OpenStack). These RAM values are higher than the RAM values themachines actually have later on:
ram_in_mib by openstack
        total_ram_in_mib by top/slurm
2048    1968
16384   15991
32768   32093
65536   64297
122880  120749
245760  241608
491520  483528
Given that we have to define the slurm.conf in advance, we kinda haveto predict how much total ram the instances have once created. Ofcourse I used linear regression to approximate the total ram and thenlowered it a bit to have some cushion, but this feels unsafe giventhat future flavors could differ from that.
From the kernel documentation<https://www.kernel.org/doc/Documentation/filesystems/proc.txt> Iknow that MemTotal is
    MemTotal: Total usable ram (i.e. physical ram minus a few
    reserved bits and the kernel binary code)
but given that the concrete reserved bits are quite complex<https://witekio.com/blog/cat-proc-meminfo-memtotal/>, I am wonderingwhether I am doing something wrong as this issue doesn't feel nicheenough to be that complicated.
---
Anyway, setting the RAM value in the slurm.conf above total ram bypredicting too much, leads to errors and nodes being marked as invalid:
    [2025-08-11T08:19:04.736] debug:  Node NODE_NAME has low
    real_memory size (241607 / 245760) < 100.00%
    [2025-08-11T08:19:04.736] error: _slurm_rpc_node_registration
    node=NODE_NAME: Invalid argument

or

    |[2025-07-03T12:57:18.486] error: Setting node NODE_NAME state to
    INVAL with reason:Low RealMemory (reported:64295 < 100.00% of
    configured:68719)|

|Any hint on how to solve this is much appreciated!
|

Best regards,
Xaver

--
slurm-users mailing list -- [email protected]
To unsubscribe send an email to [email protected]

-- 
slurm-users mailing list -- [email protected]
To unsubscribe send an email to [email protected]

[slurm-users] Re: Nodes Become Invalid Due to Less Total RAM Than Expected

Reply via email to