Exactly. The easiest way is just to underreport the amount of memory in
slurm. That way slurm will take care of it natively. We do this here as
well even though we have disks in order to make sure the OS has memory
left to run.
-Paul Edmon-
On 3/14/19 8:36 AM, Doug Meyer wrote:
We also run diskless. In the slurm.conf we round down on memory so
slurm does not have the total budget to work with and use a default
memory per job value reflecting declared memory/# of threads per node.
If users don't declarememory limit we are fine. If they declare more
we are fine too. Mostly. We had to turn off memory enforcement as the
job memory usage is very uneven during runtime but with the above we
have seldom had problems.
Doug
On Thu, Mar 14, 2019 at 3:57 AM david baker <djbake...@gmail.com
<mailto:djbake...@gmail.com>> wrote:
Hello Paul,
Thank you for your advice. That all makes sense. We're running
diskless compute nodes and so the usable memory is less than the
total memory. So I have added a memory check to my job_submit.lua
-- see below. I think that all makes sense.
Best regards,
David
-- Check memory/node is valid
if job_desc.min_mem_per_cpu == 9223372036854775808 then
job_desc.min_mem_per_cpu = 4300
end
memory = job_desc.min_mem_per_cpu * job_desc.min_cpus
if memory > 172000 then
slurm.log_user("You cannot request more than 172000 Mbytes
per node")
slurm.log_user("memory is: %u",memory)
return slurm.ERROR
end
On Tue, Mar 12, 2019 at 4:48 PM Paul Edmon <ped...@cfa.harvard.edu
<mailto:ped...@cfa.harvard.edu>> wrote:
Slurm should automatically block or reject jobs that can't run
on that partition in terms of memory usage for a single node.
So you shouldn't need to do anything. If you need something
less than the max memory per node then you will need to
enforce some limits. We do this via a jobsubmit lua script.
That would be my recommended method.
-Paul Edmon-
On 3/12/19 12:31 PM, David Baker wrote:
Hello,
I have set up a serial queue to run small jobs in the
cluster. Actually, I route jobs to this queue using the
job_submit.lua script. Any 1 node job using up to 20 cpus is
routed to this queue, unless a user submits their job with an
exclusive flag.
The partition is shared and so I defined memory to be a
resource. I've set default memory/cpu to be 4300 Mbytes.
There are 40 cpus installed in the nodes and the usable
memory is circa 17200 Mbytes -- hence my default mem/cpu.
The compute nodes are defined with RealMemory=190000, by the way.
I am curious to understand how I can impose a memory limit on
the jobs that are submitted to this partition. It doesn't
make any sense to request more than the total usable memory
on the nodes. So could anyone please advise me how to ensure
that users cannot request more than the usable memory on the
nodes.
Best regards,
David
PartitionName=serial nodes=red[460-464] Shared=Yes
MaxCPUsPerNode=40 DefaultTime=02:00:00 MaxTime=60:00:00
QOS=serial SelectTypeParameters=CR_Core_Memory
*DefMemPerCPU=4300* State=UP AllowGroups=jfAccessToIridis5
PriorityJobFactor=10 PreemptMode=off