Mike,

You don't include your entire sbatch script, so it's really hard to say what's going wrong when we only have a single line to work with. Based on what you have told us, I'm guessing you are specifying a memory requirement per node greater than 128000. When you specify a nodelist, Slurm will assign your job to all of those nodes, not a subset that matches the other job specifications (--mem or --mem-per-cpu, or --tasks, etc.):

*-w*, *--nodelist*=</node name list/>
    Request a specific list of hosts. The job will contain /all/ of
    these hosts and possibly additional hosts as needed to satisfy
resource requirements.

Prentice

On 6/7/21 7:46 PM, Yap, Mike wrote:

Hi All

Can another advise the possibilities of me encountering the error message as below when submitting a job ?

*sbatch: error: memory allocation failure*

The same script use work perfectly fine until I include *#SBATCH --nodelist=(compute[015-046])  (once removed it work as it should)*

The issues

 1. For the current setup, I have specific resources available for
    each compute node
     1. (NodeName=compute[007-014] Procs=36 CoresPerSocket=18
        RealMemory=384000 ThreadsPerCore=1 Boards=1 SocketsPerBoard=2)
        – newer model
     2. (NodeName=compute[001-006] Procs=16 CoresPerSocket=18
        RealMemory=128000 ThreadsPerCore=1 Boards=1 SocketsPerBoard=2)
 2. I have same resources sharing between multiple queue (working fine)
 3. When running on parallel job, the exact same job run when assigned
    to the same node category (ie exclusively on 1a or 1b)
 4. When running the exact same jobs but assigned between 1a and 1b,
    the job will run on 1b node but no activities on 1a

Any suggestion

Thanks

Mike

Reply via email to