Re: [slurm-users] Drain node from TaskProlog / TaskEpilog

Mark Dixon Mon, 24 May 2021 08:57:54 -0700

Hi Brian,

Thanks for replying. On our hardware, GPUs allocated to a job by cgroupsometimes get themselves into a state requiring a reboot.

Outside the job, a simple CUDA program calling the API functioncudaGetDeviceCount works happily. Inside the job, it returns an error codeof 3 (cudaErrorInitializationError).

At present, I have a TaskProlog that prods this API function and emails mewhen there is a failure. It'd be nice if the nodes could drain themselveswithout administrator intervention, rather than continuing to run waitingjobs and so causing them to fail.

I can see a couple of ways to do it (e.g. sudo script in TaskProlog, orplaying with the cgroup hierarchy outside of slurm), but was wondering ifI had misunderstood the slurm docs and there was a simpler way.


Best,

Mark

On Mon, 24 May 2021, Brian Andrus wrote:

Not sure I can understand how it can only be detected from inside the
job environment for a failed node.

That description is more of "our application is behaving badly, but not
so bad, the node quits responding." For that situation, your app or job
should have something that it is doing to catch that and report it to
slurm in some fashion (up to and including, kill the process).

Slurm polls the nodes and if slurmd does not respond, it will mark the
node as failed. So slurmd must be responding.

If you can provide a better description of what symptoms you see that
cause you to feel the node has failed, we can help a little more.

On 5/24/2021 3:02 AM, Mark Dixon wrote:

 Hi all,

 Sometimes our compute nodes get into a failed state which we can only
 detect from inside the job environment.

 I can see that TaskProlog / TaskEpilog allows us to run our detection
 test; however, unlike Epilog and Prolog, they do not drain a node if
 they exit with a non-zero exit code.

 Does anyone have advice on automatically draining a node in this
 situation, please?

 Best wishes,

 Mark

Re: [slurm-users] Drain node from TaskProlog / TaskEpilog

Reply via email to