Hi All,

I'm wondering if you've seen this issue around, I can't seem to find anything on it:

We have an NVIDIA DGX-1 that we run SLURM on in order to queue up jobs on the GPU's there, but we're running into an issue:

    1) launch a SLURM job (assume job id = 12345)

2) start a program that runs on GPUs and writes continuously to disk (e.g., to ~/test.txt)

    3) kill the process in another terminal with the command scancel 12345

You would see that although the job 12345 has been killed, the file ~/test.txt is still being written to, and that the GPU memory taken up by job 12345 is still not released.

Have you seen anything like this? Trying to figure out if it's a SLURM issue, or a GPU issue. We're running SLURM 17.02.7.

-Matt

Reply via email to