Hey,
while the *node_reg_mem_percent* parameter sounds interesting, it would
only be feasible for us on a per job basis (I wasn't able to find it
there at first glance). Many of our users need a certain amount of RAM
and jobs would fail if they have less. Therefore, this doesn't solve our
issue.
Best regards,
Xaver
On 8/14/25 10:42, Guillaume COCHARD via slurm-users wrote:
Hello,
You might want to use the *node_reg_mem_percent *parameter (
https://slurm.schedmd.com/slurm.conf.html#OPT_node_reg_mem_percent ).
For example, if set to 80, it will allow a node to work even if it has
only 80% of the declared memory.
Guillaume
------------------------------------------------------------------------
*De: *"Xaver Stiensmeier via slurm-users" <[email protected]>
*À: *[email protected]
*Envoyé: *Jeudi 14 Août 2025 10:01:26
*Objet: *[slurm-users] Nodes Become Invalid Due to Less Total RAM Than
Expected
Dear slurm-user list,
in the past we had a bigger buffer between RealMemory
<https://slurm.schedmd.com/slurm.conf.html#OPT_RealMemory> and the
instance memory. We then discovered that the right way is to
activating the *memory option* (SelectTypeParameters=CR_Core_Memory)
and setting MemSpecLimit
<https://slurm.schedmd.com/slurm.conf.html#OPT_MemSpecLimit> to secure
RAM for system processes.
However, now we run into the problem that due to *on demand
scheduling*, we have to setup the slurm.conf in advance by using the
RAM values from our flavors as reported by our cloud provider
(OpenStack). These RAM values are higher than the RAM values the
machines actually have later on:
ram_in_mib by openstack
total_ram_in_mib by top/slurm
2048 1968
16384 15991
32768 32093
65536 64297
122880 120749
245760 241608
491520 483528
Given that we have to define the slurm.conf in advance, we kinda have
to predict how much total ram the instances have once created. Of
course I used linear regression to approximate the total ram and then
lowered it a bit to have some cushion, but this feels unsafe given
that future flavors could differ from that.
From the kernel documentation
<https://www.kernel.org/doc/Documentation/filesystems/proc.txt> I know
that MemTotal is
MemTotal: Total usable ram (i.e. physical ram minus a few reserved
bits and the kernel binary code)
but given that the concrete reserved bits are quite complex
<https://witekio.com/blog/cat-proc-meminfo-memtotal/>, I am wondering
whether I am doing something wrong as this issue doesn't feel niche
enough to be that complicated.
---
Anyway, setting the RAM value in the slurm.conf above total ram by
predicting too much, leads to errors and nodes being marked as invalid:
[2025-08-11T08:19:04.736] debug: Node NODE_NAME has low
real_memory size (241607 / 245760) < 100.00%
[2025-08-11T08:19:04.736] error: _slurm_rpc_node_registration
node=NODE_NAME: Invalid argument
or
|[2025-07-03T12:57:18.486] error: Setting node NODE_NAME state to
INVAL with reason:Low RealMemory (reported:64295 < 100.00% of
configured:68719)|
|Any hint on how to solve this is much appreciated!
|
Best regards,
Xaver
--
slurm-users mailing list -- [email protected]
To unsubscribe send an email to [email protected]
--
slurm-users mailing list -- [email protected]
To unsubscribe send an email to [email protected]