Re: [slurm-users] [External] sbatch: error: memory allocation failure

Prentice Bisbal Thu, 17 Jun 2021 16:47:10 -0700

Mike,

You don't include your entire sbatch script, so it's really hard to saywhat's going wrong when we only have a single line to work with. Basedon what you have told us, I'm guessing you are specifying a memoryrequirement per node greater than 128000. When you specify a nodelist,Slurm will assign your job to all of those nodes, not a subset thatmatches the other job specifications (--mem or --mem-per-cpu, or--tasks, etc.):

*-w*, *--nodelist*=</node name list/>
    Request a specific list of hosts. The job will contain /all/ of
    these hosts and possibly additional hosts as needed to satisfy

resource requirements.


Prentice

On 6/7/21 7:46 PM, Yap, Mike wrote:


Hi All

Can another advise the possibilities of me encountering the errormessage as below when submitting a job ?


*sbatch: error: memory allocation failure*

The same script use work perfectly fine until I include *#SBATCH--nodelist=(compute[015-046]) (once removed it work as it should)*


The issues

 1. For the current setup, I have specific resources available for
    each compute node
     1. (NodeName=compute[007-014] Procs=36 CoresPerSocket=18
        RealMemory=384000 ThreadsPerCore=1 Boards=1 SocketsPerBoard=2)
        – newer model
     2. (NodeName=compute[001-006] Procs=16 CoresPerSocket=18
        RealMemory=128000 ThreadsPerCore=1 Boards=1 SocketsPerBoard=2)
 2. I have same resources sharing between multiple queue (working fine)
 3. When running on parallel job, the exact same job run when assigned
    to the same node category (ie exclusively on 1a or 1b)
 4. When running the exact same jobs but assigned between 1a and 1b,
    the job will run on 1b node but no activities on 1a

Any suggestion

Thanks

Mike

Re: [slurm-users] [External] sbatch: error: memory allocation failure

Reply via email to