👍 Best,
Feng On Wed, Sep 20, 2023 at 7:29 AM Wagner, Marcus <wag...@itc.rwth-aachen.de> wrote: > Even after rebooting, sometimes nodes are stuck because of "completing > jobs". > > What helps then is to set the node down and resume it afterwards: > > scontrol update nodename=<nodename> state=drain reason=stuck; scontrol > update nodename=<nodename> state=resume > > > Best > Marcus > > Am 20.09.2023 um 09:11 schrieb Ole Holm Nielsen: > > On 9/20/23 01:39, Feng Zhang wrote: > >> Restarting the slurmd dameon of the compute node should work, if the > >> node is still online and normal. > > > > Probably not. If the filesystem used by the job is hung, the node > > must probably be rebooted, and the filesystem must be checked. > > > > /Ole > > > >> On Tue, Sep 19, 2023 at 8:03 AM Felix <fe...@itim-cj.ro> wrote: > >>> > >>> Hello > >>> > >>> I have a job on my system which is running more than its time, more > >>> than > >>> 4 days. > >>> > >>> 1808851 debug gridjob atlas01 CG 4-00:00:19 1 awn-047 > >>> > >>> I'm trying to cancel it > >>> > >>> [@arc7-node ~]# scancel 1808851 > >>> > >>> I get no message as if the job was canceled but when getting > >>> information > >>> about the job, the job is still there > >>> > >>> [@arc7-node ~]# squeue | grep awn-047 > >>> 1808851 debug gridjob atlas01 CG 4-00:00:19 1 > >>> awn-047 > >>> > >>> Can I do any other thinks to kill end the job? > > >