Hi all,Sometimes our compute nodes get into a failed state which we can only detect from inside the job environment.
I can see that TaskProlog / TaskEpilog allows us to run our detection test; however, unlike Epilog and Prolog, they do not drain a node if they exit with a non-zero exit code.
Does anyone have advice on automatically draining a node in this situation, please?
Best wishes, Mark