Matt, I saw a similar situation with a PBS job recently. A process with is writing to disk cannot be killed (it is in S state). So the job ended but PBS logged that it could not kill the process. I would look in detail at the slurm logs at the point where that job is being killed, and you might get some information. I guess this depends on the method which Slurm uses to kill a job.
Of course this could be a completely different scanario. On 22 November 2017 at 14:55, Matt McKinnon <m...@techsquare.com> wrote: > Hi All, > > I'm wondering if you've seen this issue around, I can't seem to find > anything on it: > > We have an NVIDIA DGX-1 that we run SLURM on in order to queue up jobs on > the GPU's there, but we're running into an issue: > > 1) launch a SLURM job (assume job id = 12345) > > 2) start a program that runs on GPUs and writes continuously to disk > (e.g., to ~/test.txt) > > 3) kill the process in another terminal with the command scancel 12345 > > You would see that although the job 12345 has been killed, the file > ~/test.txt is still being written to, and that the GPU memory taken up by > job 12345 is still not released. > > Have you seen anything like this? Trying to figure out if it's a SLURM > issue, or a GPU issue. We're running SLURM 17.02.7. > > -Matt > >