Re: [slurm-users] Slurm: Handling nodes that fail to POWER_UP in a cloud scheduling system

2022-11-23 Thread Brian Andrus
Xavier, You want to use the ResumeFailedProgram script. We use a full cloud cluster and that is where we deal with things like this. It will get called if your ResumeProgram does not result in slurmd being available on the node in a timely manner (whatever the reason). Writing it yourself mak

[slurm-users] Slurm: Handling nodes that fail to POWER_UP in a cloud scheduling system

2022-11-23 Thread Xaver Stiensmeier
Hello slurm-users, The question can be found in a similar fashion here: https://stackoverflow.com/questions/74529491/slurm-handling-nodes-that-fail-to-power-up-in-a-cloud-scheduling-system Issue Current behavior and problem description When a node fails to |POWER_UP|, it is marked |DOWN|.