Re: [slurm-users] Rolling reboot with at most N machines down simultaneously?

2022-08-04 Thread Chris Samuel
On 3/8/22 10:20 pm, Gerhard Strangar wrote: With a fake license called reboot? It's a neat idea, but I think there is a catch: * 3 jobs start, each taking 1 license * Other reboot jobs are all blocked * Running reboot jobs trigger node reboot * Running reboot jobs end when either the script e

Re: [slurm-users] Rolling reboot with at most N machines down simultaneously?

2022-08-04 Thread David Simpson
Another way might be to implement slurm power off/on (if not already) and induce it as required. - David Simpson - Senior Systems Engineer ARCCA, Redwood Building, King Edward VII Avenue, Cardiff, CF10 3NB       

Re: [slurm-users] Rolling reboot with at most N machines down simultaneously?

2022-08-04 Thread Brian Andrus
This is actually brilliant! Brian Andrus On 8/3/2022 10:20 PM, Gerhard Strangar wrote: Phil Chiu wrote: - Individual slurm jobs which reboot nodes - With a for loop, I could submit a reboot job for each node. But I'm not sure how to limit this so at most N jobs are running simulta

Re: [slurm-users] Rolling reboot with at most N machines down simultaneously?

2022-08-04 Thread Tina Friedrich
...job dependencies are also an option, thinking about this. You could carve it up into X 'sets' of N nodes, with node-specific reboot jobs that depend on the previous job in the same 'N' to finish. Tina On 04/08/2022 11:23, Tina Friedrich wrote: I'm thinking something like that currently - se

Re: [slurm-users] Rolling reboot with at most N machines down simultaneously?

2022-08-04 Thread Tina Friedrich
I'm thinking something like that currently - setting up some kind of TRES resource that limits how many are rebooted at any one time. I usually do this sort of thing more or less manually; as in, I generated a list of sbatch commands with the reboot job (one job per node, specifying node name)