Hello all. I've found that sometimes, some jobs leave the nodes in DRAINING state.
In slurmctld.log I find: -8<-- [2020-10-28T11:30:16.999] update_node: node str957-mtx-11 reason set to: Kill task failed [2020-10-28T11:30:16.999] update_node: node str957-mtx-11 state set to DRAINING -8<-- while on the node (slurmd.log): -8<-- [2020-10-28T11:24:11.980] [8975.0] task/cgroup: /slurm_str957-mtx-11/uid_2126297435/job_8975: alloc=117600MB mem.limit=117600MB memsw.limit=117600MB [2020-10-28T11:24:11.980] [8975.0] task/cgroup: /slurm_str957-mtx-11/uid_2126297435/job_8975/step_0: alloc=117600MB mem.limit=117600MB memsw.limit=117600MB [2020-10-28T11:29:18.926] [8975.0] Defering sending signal, processes in job are currently core dumping [2020-10-28T11:30:17.000] [8975.0] error: *** STEP 8975.0 STEPD TERMINATED ON str957-mtx-11 AT 2020-10-28T11:30:16 DUE TO JOB NOT ENDING WITH SIGNALS *** [2020-10-28T11:30:19.306] [8975.0] done with job -8<-- Seems slurmd takes a bit too much time to close the job. Is there some timeout I could change to avoid having to fix it manually? TIA. -- Diego Zuccato DIFA - Dip. di Fisica e Astronomia Servizi Informatici Alma Mater Studiorum - Università di Bologna V.le Berti-Pichat 6/2 - 40127 Bologna - Italy tel.: +39 051 20 95786