Re: [slurm-users] Rolling reboot with at most N machines down simultaneously?

2022-08-03 Thread Gerhard Strangar
Phil Chiu wrote: >- Individual slurm jobs which reboot nodes - With a for loop, I could >submit a reboot job for each node. But I'm not sure how to limit this so at >most N jobs are running simultaneously. With a fake license called reboot?

Re: [slurm-users] Rolling reboot with at most N machines down simultaneously?

2022-08-03 Thread Christopher Samuel
On 8/3/22 11:47 am, Benjamin Arntzen wrote: At risk of being a heretic, why not something like Ansible to handle this? Nothing heretical about that, but for us the reason is that `scontrol reboot ASAP` is integrated nicely into the scheduling of jobs, we have health checks and node epilogs t

Re: [slurm-users] Rolling reboot with at most N machines down simultaneously?

2022-08-03 Thread Christopher Samuel
On 8/3/22 8:37 am, Phil Chiu wrote: Therefore my problem is this - "Reboot all nodes, permitting N nodes to be rebooting simultaneously." I think currently the only way to do that would be to have a script that does: * issue the `scontrol reboot ASAP nextstate=resume [...]` for 3 nodes * wa

Re: [slurm-users] Frontend node mode issues identified in v22.05.2

2022-08-03 Thread Jordi Blasco
Hi, I have been maintaining a Slurm simulator for ages. I have everything automated in other to try new features and keep my configuration up to date, version after version. Unfortunately, from version 21, the

Re: [slurm-users] Rolling reboot with at most N machines down simultaneously?

2022-08-03 Thread Brian Andrus
So an example of using slurm to reboot all nodes 3 at a time:     sinfo -h -o %n|xargs --max-procs=3 scontrol reboot {} If you want to get fancy, make a script that does the reboot and waits for the node to be back up before exiting and use that instead of the 'scontrol reboot' part. Brian

Re: [slurm-users] Rolling reboot with at most N machines down simultaneously?

2022-08-03 Thread Benjamin Arntzen
At risk of being a heretic, why not something like Ansible to handle this? Slurm "should" be able to do it but feels like a bit of a weird fit for the job.From: slurm-users on behalf of Phil Chiu Sent: Wednesday, 3 August 2022, 5:51 pmTo: slurm-us...@schedmd.com Subject: [slurm-users] Rolling rebo

[slurm-users] Rolling reboot with at most N machines down simultaneously?

2022-08-03 Thread Phil Chiu
Occasionally I need to all the compute nodes in my system. However, I have a parallel file system which is *converged*, i.e., each compute node contributes a disk to the file system. The file system can tolerate having N nodes down simultaneously. Therefore my problem is this - "Reboot all nodes,

Re: [slurm-users] unable to ssh onto compute nodes on which I have running jobs

2022-08-03 Thread byron
Thanks for everyones help. All I needed to do was compile a new version of pam_slurm.so. I'm aware there's a newer slurm_pam_adopt but everything was already setup for pam_slurm.so so I just went with that. Regards Lloyd On Wed, Jul 27, 2022 at 9:45 PM Bernd Melchers wrote: > >This happen