Hi,

Can someone help me understand what this error is?

select/cons_res: node cn95 memory is under-allocated (125000-135000) for 
JobId=23544043

We get a lot of these from time to time and I don't understand what its about?

Looking at the code it doesn't make sense for this to be happening on running 
jobs.

plugins/select/cons_res/select_cons_res.c

/*
 * deallocate resources previously allocated to the given job
 * - subtract 'struct job_resources' resources from 'struct part_res_record'
 * - subtract job's memory requirements from 'struct node_res_record'
 *
 * if action = 0 then subtract cores, memory + GRES (running job was terminated)
 * if action = 1 then subtract memory + GRES (suspended job was terminated)
 * if action = 2 then only subtract cores (job is suspended)
 */
static int _rm_job_from_res(struct part_res_record *part_record_ptr,
                            struct node_use_record *node_usage,
                            struct job_record *job_ptr, int action)

...
if (action != 2) {
                        if (node_usage[i].alloc_memory <
                            job->memory_allocated[n]) {
                                error("%s: node %s memory is under-allocated 
(%"PRIu64"-%"PRIu64") for %pJ",
                                      plugin_type, node_ptr->name,
                                      node_usage[i].alloc_memory,
                                      job->memory_allocated[n],
                                      job_ptr);
                                node_usage[i].alloc_memory = 0;
                        } else
                                node_usage[i].alloc_memory -=
                                        job->memory_allocated[n];
                }
...

It appears to me that the function should be called when basically a job has 
ended or suspended. Yet, these errors are being printed for running jobs. Is 
slurm actually deallocating resources for that job? And thus there is more 
memory that could be used for other jobs? I don't think that is the case.

Anyone have a thought here?

My initial feeling is .. Who cares if the node is under-allocated? Yes, it 
would be great if the user actually comes close to using the memory/resource 
they asked for so that it is not wasted, but this typically doesn't happen. Is 
this error there to let sysadmins know that maybe you should overprovision the 
memory? Or maybe there is a config issue on our side? I don't think the latter 
is the case.

Thanks!

Best,
Chris


—
Christopher Coffey
High-Performance Computing
Northern Arizona University
928-523-1167
 

Reply via email to