Xavier,
You want to use the ResumeFailedProgram script.
We use a full cloud cluster and that is where we deal with things like
this. It will get called if your ResumeProgram does not result in slurmd
being available on the node in a timely manner (whatever the reason).
Writing it yourself mak
Hello slurm-users,
The question can be found in a similar fashion here:
https://stackoverflow.com/questions/74529491/slurm-handling-nodes-that-fail-to-power-up-in-a-cloud-scheduling-system
Issue
Current behavior and problem description
When a node fails to |POWER_UP|, it is marked |DOWN|.