Hello Paul, Thank you for your advice. That all makes sense. We're running diskless compute nodes and so the usable memory is less than the total memory. So I have added a memory check to my job_submit.lua -- see below. I think that all makes sense.
Best regards, David -- Check memory/node is valid if job_desc.min_mem_per_cpu == 9223372036854775808 then job_desc.min_mem_per_cpu = 4300 end memory = job_desc.min_mem_per_cpu * job_desc.min_cpus if memory > 172000 then slurm.log_user("You cannot request more than 172000 Mbytes per node") slurm.log_user("memory is: %u",memory) return slurm.ERROR end On Tue, Mar 12, 2019 at 4:48 PM Paul Edmon <ped...@cfa.harvard.edu> wrote: > Slurm should automatically block or reject jobs that can't run on that > partition in terms of memory usage for a single node. So you shouldn't > need to do anything. If you need something less than the max memory per > node then you will need to enforce some limits. We do this via a jobsubmit > lua script. That would be my recommended method. > > > -Paul Edmon- > > > On 3/12/19 12:31 PM, David Baker wrote: > > Hello, > > > I have set up a serial queue to run small jobs in the cluster. Actually, I > route jobs to this queue using the job_submit.lua script. Any 1 node job > using up to 20 cpus is routed to this queue, unless a user submits > their job with an exclusive flag. > > > The partition is shared and so I defined memory to be a resource. I've set > default memory/cpu to be 4300 Mbytes. There are 40 cpus installed in the > nodes and the usable memory is circa 17200 Mbytes -- hence my default > mem/cpu. > > > The compute nodes are defined with RealMemory=190000, by the way. > > > I am curious to understand how I can impose a memory limit on the jobs > that are submitted to this partition. It doesn't make any sense to request > more than the total usable memory on the nodes. So could anyone please > advise me how to ensure that users cannot request more than the usable > memory on the nodes. > > > Best regards, > > David > > > PartitionName=serial nodes=red[460-464] Shared=Yes MaxCPUsPerNode=40 > DefaultTime=02:00:00 MaxTime=60:00:00 QOS=serial > SelectTypeParameters=CR_Core_Memory *DefMemPerCPU=4300* State=UP > AllowGroups=jfAccessToIridis5 PriorityJobFactor=10 PreemptMode=off > > > >