Re: [slurm-users] help with canceling or deleteing a job

Feng Zhang Wed, 20 Sep 2023 08:33:29 -0700

👍

Best,


Feng


On Wed, Sep 20, 2023 at 7:29 AM Wagner, Marcus <wag...@itc.rwth-aachen.de>
wrote:

> Even after rebooting, sometimes nodes are stuck because of "completing
> jobs".
>
> What helps then is to set the node down and resume it afterwards:
>
> scontrol update nodename=<nodename> state=drain reason=stuck; scontrol
> update nodename=<nodename> state=resume
>
>
> Best
> Marcus
>
> Am 20.09.2023 um 09:11 schrieb Ole Holm Nielsen:
> > On 9/20/23 01:39, Feng Zhang wrote:
> >> Restarting the slurmd dameon of the compute node should work, if the
> >> node is still online and normal.
> >
> > Probably not.  If the filesystem used by the job is hung, the node
> > must probably be rebooted, and the filesystem must be checked.
> >
> > /Ole
> >
> >> On Tue, Sep 19, 2023 at 8:03 AM Felix <fe...@itim-cj.ro> wrote:
> >>>
> >>> Hello
> >>>
> >>> I have a job on my system which is running more than its time, more
> >>> than
> >>> 4 days.
> >>>
> >>> 1808851     debug  gridjob  atlas01 CG 4-00:00:19      1 awn-047
> >>>
> >>> I'm trying to cancel it
> >>>
> >>> [@arc7-node ~]# scancel 1808851
> >>>
> >>> I get no message as if the job was canceled but when getting
> >>> information
> >>> about the job, the job is still there
> >>>
> >>> [@arc7-node ~]# squeue | grep awn-047
> >>>              1808851     debug  gridjob  atlas01 CG 4-00:00:19 1
> >>> awn-047
> >>>
> >>> Can I do any other thinks to kill end the job?
> >
>

Re: [slurm-users] help with canceling or deleteing a job

Reply via email to