Hi all,

Sometimes our compute nodes get into a failed state which we can only detect from inside the job environment.

I can see that TaskProlog / TaskEpilog allows us to run our detection test; however, unlike Epilog and Prolog, they do not drain a node if they exit with a non-zero exit code.

Does anyone have advice on automatically draining a node in this situation, please?

Best wishes,

Mark

Reply via email to