Hi Folks,

I’d like to setup an email notification, perhaps via cron (unless there’s a 
better method) of notifying the sysadmin when a Slurm node is down and/or not 
firing off jobs...

For example, using ‘squeue’ in NODELIST(REASON) I recently saw:

(Nodes required for job are DOWN, DRAINED or reserved for jobs in higher 
priority partitions)

And using ‘sinfo’ I saw:

% sinfo -Nl
Fri May 07 08:49:26 2021
NODELIST   NODES PARTITION       STATE CPUS    S:C:T MEMORY TMP_DISK WEIGHT 
AVAIL_FE REASON              
trom         1    short*    draining 112    2:56:2 204800        0      1   
(null) Kill task failed    
trom         1      long    draining 112    2:56:2 204800        0      1   
(null) Kill task failed    

I’m not sure what would be the best value to grep for, as I suspect there are 
other states than DOWN or DRAINED that might mean a node is down and not firing 
off jobs?

Thanks in advance for your ideas,

Doug


Reply via email to