Hi,

scancel the job, then set the nodes to a "down" state like so "scontrol update 
nodename=<nodename> state=down reason=cg" and resume them afterwards.
However, if there are tasks stuck, then in most cases a reboot is needed to 
bring the node back with in a clean state.

Best,
Florian
________________________________
From: slurm-users <slurm-users-boun...@lists.schedmd.com> on behalf of Durai 
Arasan <arasan.du...@gmail.com>
Sent: Friday, 20 August 2021 10:31
To: Slurm User Community List <slurm-users@lists.schedmd.com>
Subject: [External][slurm-users] jobs stuck in "CG" state

Hello!

We have a huge number of jobs stuck in CG state from a user who probably wrote 
code with bad I/O. "scancel" does not make them go away. Is there a way for 
admins to get rid of these jobs without draining and rebooting the nodes. I 
read somewhere that killing the respective slurmstepd process will do the job. 
Is this possible? Any other solutions? Also are there any parameters in 
slurm.conf one can set to manage such situations better?

Best,
Durai
MPI Tübingen

Reply via email to