May try with this workaround scontrol update NodeName=<node name> State=IDLE
Thanks & Regards, Sudeep Narayan Banerjee System Analyst | Scientist B Information System and Technology Facility Indian Institute of Technology Gandhinagar Palaj, Gujarat 382355, INDIA On Wed, Oct 28, 2020 at 5:41 PM Diego Zuccato <diego.zucc...@unibo.it> wrote: > Hello all. > > I've found that sometimes, some jobs leave the nodes in DRAINING state. > > In slurmctld.log I find: > -8<-- > [2020-10-28T11:30:16.999] update_node: node str957-mtx-11 reason set to: > Kill task failed > [2020-10-28T11:30:16.999] update_node: node str957-mtx-11 state set to > DRAINING > -8<-- > while on the node (slurmd.log): > -8<-- > [2020-10-28T11:24:11.980] [8975.0] task/cgroup: > /slurm_str957-mtx-11/uid_2126297435/job_8975: alloc=117600MB > mem.limit=117600MB memsw.limit=117600MB > [2020-10-28T11:24:11.980] [8975.0] task/cgroup: > /slurm_str957-mtx-11/uid_2126297435/job_8975/step_0: alloc=117600MB > mem.limit=117600MB memsw.limit=117600MB > [2020-10-28T11:29:18.926] [8975.0] Defering sending signal, processes in > job are currently core dumping > [2020-10-28T11:30:17.000] [8975.0] error: *** STEP 8975.0 STEPD > TERMINATED ON str957-mtx-11 AT 2020-10-28T11:30:16 DUE TO JOB NOT ENDING > WITH SIGNALS *** > [2020-10-28T11:30:19.306] [8975.0] done with job > -8<-- > > Seems slurmd takes a bit too much time to close the job. Is there some > timeout I could change to avoid having to fix it manually? > > TIA. > > -- > Diego Zuccato > DIFA - Dip. di Fisica e Astronomia > Servizi Informatici > Alma Mater Studiorum - Università di Bologna > V.le Berti-Pichat 6/2 - 40127 Bologna - Italy > tel.: +39 051 20 95786 > >