How are you taking them offline? I would expect a SuspendProgram script
that is running the command that shuts them down. Also, one of your
SlurmctldParameters should be "idle_on_node_suspend"
Brian Andrus
On 4/1/2021 12:25 PM, Sajesh Singh wrote:
Brian,
Targeting the correct partition an
Brian,
Targeting the correct partition and no QOS limits imposed that would cause
this issue. The only way I found to remedy is to completely remove the cloud
nodes from Slurm, restart slurmctld, readd nodes to Slurm, restart slurmctld.
I believe the issue is caused by when the nodes in the cl
For this one, you want to look closely at the job. Is it targeting a
specific partition/nodelist?
See what resources it is looking for (scontrol show job )
Also look at the partition limits as well as any QOS items (if you are
using them).
Brian Andrus
On 4/1/2021 10:00 AM, Sajesh Singh wrot
Some additional information after enabling debug3 on slurmctld daemon:
Logs show that there are enough usable nodes for the job:
[2021-04-01T10:39:14.400] debug2: found 1 usable nodes from config containing
node-11
[2021-04-01T10:39:14.400] debug2: found 1 usable nodes from config containing
no
Run 'sinfo -R' to see if any of your nodes are out of the mix.
If so, resume them and see if things work.
Brian Andrus
On 4/1/2021 1:53 AM, Steve Brasier wrote:
Hi all, anyone have suggestions for debugging cloud nodes not
resuming? I've had this working before but I'm now using "configless"