Re: [slurm-users] Drain node from TaskProlog / TaskEpilog

2021-05-25 Thread Mark Dixon
Thanks to everyone for their help, much appreciated. Seems to confirm that things would be much easier if I could just figure out a way to detect the issue from the prolog/epilog, rather than the taskprolog/taskepilog! All the best, Mark On Mon, 24 May 2021, Brian Andrus wrote: [EXTERNAL

Re: [slurm-users] Drain node from TaskProlog / TaskEpilog

2021-05-24 Thread Christopher Samuel
On 5/24/21 3:02 am, Mark Dixon wrote: Does anyone have advice on automatically draining a node in this situation, please? We do some health checks via a node epilog set with the "Epilog" setting, including queueing node reboots with "scontrol reboot". All the best, Chris -- Chris Samuel

Re: [slurm-users] Drain node from TaskProlog / TaskEpilog

2021-05-24 Thread Brian Andrus
Ah. I'll proceed under the scenario that there is a piece of hardware that is being tested and may lock up (The GPU in this case). If you are able to identify the issue is occurring from within the job, you should exit the job with an error or some signal to alert slurm (eg: a semaphore file).

Re: [slurm-users] Drain node from TaskProlog / TaskEpilog

2021-05-24 Thread Mark Dixon
Hi Brian, Thanks for replying. On our hardware, GPUs allocated to a job by cgroup sometimes get themselves into a state requiring a reboot. Outside the job, a simple CUDA program calling the API function cudaGetDeviceCount works happily. Inside the job, it returns an error code of 3 (cudaErr

Re: [slurm-users] Drain node from TaskProlog / TaskEpilog

2021-05-24 Thread Brian Andrus
Not sure I can understand how it can only be detected from inside the job environment for a failed node. That description is more of "our application is behaving badly, but not so bad, the node quits responding." For that situation, your app or job should have something that it is doing to cat

[slurm-users] Drain node from TaskProlog / TaskEpilog

2021-05-24 Thread Mark Dixon
Hi all, Sometimes our compute nodes get into a failed state which we can only detect from inside the job environment. I can see that TaskProlog / TaskEpilog allows us to run our detection test; however, unlike Epilog and Prolog, they do not drain a node if they exit with a non-zero exit code