Re: [slurm-users] Limit on number of nodes user able to request

2021-04-01 Thread Brian Andrus
How are you taking them offline? I would expect a SuspendProgram script that is running the command that shuts them down. Also, one of your SlurmctldParameters should be "idle_on_node_suspend" Brian Andrus On 4/1/2021 12:25 PM, Sajesh Singh wrote: Brian,   Targeting the correct partition an

Re: [slurm-users] Limit on number of nodes user able to request

2021-04-01 Thread Sajesh Singh
Brian, Targeting the correct partition and no QOS limits imposed that would cause this issue. The only way I found to remedy is to completely remove the cloud nodes from Slurm, restart slurmctld, readd nodes to Slurm, restart slurmctld. I believe the issue is caused by when the nodes in the cl

Re: [slurm-users] Limit on number of nodes user able to request

2021-04-01 Thread Brian Andrus
For this one, you want to look closely at the job. Is it targeting a specific partition/nodelist? See what resources it is looking for (scontrol show job ) Also look at the partition limits as well as any QOS items (if you are using them). Brian Andrus On 4/1/2021 10:00 AM, Sajesh Singh wrot

Re: [slurm-users] Limit on number of nodes user able to request

2021-04-01 Thread Sajesh Singh
Some additional information after enabling debug3 on slurmctld daemon: Logs show that there are enough usable nodes for the job: [2021-04-01T10:39:14.400] debug2: found 1 usable nodes from config containing node-11 [2021-04-01T10:39:14.400] debug2: found 1 usable nodes from config containing no

Re: [slurm-users] Slurm cloud scheduling/power saving

2021-04-01 Thread Brian Andrus
Run 'sinfo -R' to see if any of your nodes are out of the mix. If so, resume them and see if things work. Brian Andrus On 4/1/2021 1:53 AM, Steve Brasier wrote: Hi all, anyone have suggestions for debugging cloud nodes not resuming? I've had this working before but I'm now using "configless"