Hi Folks, I’d like to setup an email notification, perhaps via cron (unless there’s a better method) of notifying the sysadmin when a Slurm node is down and/or not firing off jobs...
For example, using ‘squeue’ in NODELIST(REASON) I recently saw: (Nodes required for job are DOWN, DRAINED or reserved for jobs in higher priority partitions) And using ‘sinfo’ I saw: % sinfo -Nl Fri May 07 08:49:26 2021 NODELIST NODES PARTITION STATE CPUS S:C:T MEMORY TMP_DISK WEIGHT AVAIL_FE REASON trom 1 short* draining 112 2:56:2 204800 0 1 (null) Kill task failed trom 1 long draining 112 2:56:2 204800 0 1 (null) Kill task failed I’m not sure what would be the best value to grep for, as I suspect there are other states than DOWN or DRAINED that might mean a node is down and not firing off jobs? Thanks in advance for your ideas, Doug