Hi Doug,

Slurm has the strigger[1] mechanism that can do exactly that, the manpage even has your use case as an example. It works quite well for us.

Best,
Marcus

[1] https://slurm.schedmd.com/strigger.html

On 26.06.21 19:10, Doug Niven wrote:
Hi Folks,

I’d like to setup an email notification, perhaps via cron (unless there’s a 
better method) of notifying the sysadmin when a Slurm node is down and/or not 
firing off jobs...

For example, using ‘squeue’ in NODELIST(REASON) I recently saw:

(Nodes required for job are DOWN, DRAINED or reserved for jobs in higher 
priority partitions)

And using ‘sinfo’ I saw:

% sinfo -Nl
Fri May 07 08:49:26 2021
NODELIST   NODES PARTITION       STATE CPUS    S:C:T MEMORY TMP_DISK WEIGHT 
AVAIL_FE REASON
trom         1    short*    draining 112    2:56:2 204800        0      1   
(null) Kill task failed
trom         1      long    draining 112    2:56:2 204800        0      1   
(null) Kill task failed

I’m not sure what would be the best value to grep for, as I suspect there are 
other states than DOWN or DRAINED that might mean a node is down and not firing 
off jobs?

Thanks in advance for your ideas,

Doug



--
Marcus Vincent Boden, M.Sc.
Arbeitsgruppe eScience, HPC-Team
Tel.:   +49 (0)551 201-2191, E-Mail: mbo...@gwdg.de
-------------------------------------------------------------------------
Gesellschaft für wissenschaftliche Datenverarbeitung mbH Göttingen (GWDG)
Am Faßberg 11, 37077 Göttingen, URL: https://www.gwdg.de

Support: Tel.: +49 551 201-1523, URL: https://www.gwdg.de/support
Sekretariat: Tel.: +49 551 201-1510, Fax: -2150, E-Mail: g...@gwdg.de

Geschäftsführer: Prof. Dr. Ramin Yahyapour
Aufsichtsratsvorsitzender: Prof. Dr. Norbert Lossau
Sitz der Gesellschaft: Göttingen
Registergericht: Göttingen, Handelsregister-Nr. B 598

Zertifiziert nach ISO 9001
-------------------------------------------------------------------------

Attachment: smime.p7s
Description: S/MIME Cryptographic Signature

Reply via email to