Re: [slurm-users] GPU job still running after SLURM job is killed

John Hearns Wed, 22 Nov 2017 06:15:07 -0800

Matt,
  I saw a similar situation with a PBS job recently.
A process with is writing to disk cannot be killed (it is in S state). So
the job ended but PBS logged that it could not kill the process.
I would look in detail at the slurm logs at the point where that job is
being killed, and you might get some information.
I guess this depends on the method which Slurm uses to kill a job.


Of course this could be a completely different scanario.


On 22 November 2017 at 14:55, Matt McKinnon <m...@techsquare.com> wrote:

> Hi All,
>
> I'm wondering if you've seen this issue around, I can't seem to find
> anything on it:
>
> We have an NVIDIA DGX-1 that we run SLURM on in order to queue up jobs on
> the GPU's there, but we're running into an issue:
>
>     1) launch a SLURM job (assume job id = 12345)
>
>     2) start a program that runs on GPUs and writes continuously to disk
> (e.g., to ~/test.txt)
>
>     3) kill the process in another terminal with the command scancel 12345
>
> You would see that although the job 12345 has been killed, the file
> ~/test.txt is still being written to, and that the GPU memory taken up by
> job 12345 is still not released.
>
> Have you seen anything like this?  Trying to figure out if it's a SLURM
> issue, or a GPU issue.  We're running SLURM 17.02.7.
>
> -Matt
>
>

Re: [slurm-users] GPU job still running after SLURM job is killed

Reply via email to