Hi Doug,Slurm has the strigger[1] mechanism that can do exactly that, the manpage even has your use case as an example. It works quite well for us.
Best, Marcus [1] https://slurm.schedmd.com/strigger.html On 26.06.21 19:10, Doug Niven wrote:
Hi Folks, I’d like to setup an email notification, perhaps via cron (unless there’s a better method) of notifying the sysadmin when a Slurm node is down and/or not firing off jobs... For example, using ‘squeue’ in NODELIST(REASON) I recently saw: (Nodes required for job are DOWN, DRAINED or reserved for jobs in higher priority partitions) And using ‘sinfo’ I saw: % sinfo -Nl Fri May 07 08:49:26 2021 NODELIST NODES PARTITION STATE CPUS S:C:T MEMORY TMP_DISK WEIGHT AVAIL_FE REASON trom 1 short* draining 112 2:56:2 204800 0 1 (null) Kill task failed trom 1 long draining 112 2:56:2 204800 0 1 (null) Kill task failed I’m not sure what would be the best value to grep for, as I suspect there are other states than DOWN or DRAINED that might mean a node is down and not firing off jobs? Thanks in advance for your ideas, Doug
-- Marcus Vincent Boden, M.Sc. Arbeitsgruppe eScience, HPC-Team Tel.: +49 (0)551 201-2191, E-Mail: mbo...@gwdg.de ------------------------------------------------------------------------- Gesellschaft für wissenschaftliche Datenverarbeitung mbH Göttingen (GWDG) Am Faßberg 11, 37077 Göttingen, URL: https://www.gwdg.de Support: Tel.: +49 551 201-1523, URL: https://www.gwdg.de/support Sekretariat: Tel.: +49 551 201-1510, Fax: -2150, E-Mail: g...@gwdg.de Geschäftsführer: Prof. Dr. Ramin Yahyapour Aufsichtsratsvorsitzender: Prof. Dr. Norbert Lossau Sitz der Gesellschaft: Göttingen Registergericht: Göttingen, Handelsregister-Nr. B 598 Zertifiziert nach ISO 9001 -------------------------------------------------------------------------
smime.p7s
Description: S/MIME Cryptographic Signature