Re: [slurm-users] Rolling reboot with at most N machines down simultaneously?

Benjamin Arntzen Wed, 03 Aug 2022 11:49:50 -0700

At risk of being a heretic, why not something like Ansible to handle this? Slurm "should" be able to do it but feels like a bit of a weird fit for the job.

From: slurm-users <slurm-users-boun...@lists.schedmd.com> on behalf of Phil Chiu <whophilc...@gmail.com>
Sent: Wednesday, 3 August 2022, 5:51 pm
To: slurm-us...@schedmd.com <slurm-us...@schedmd.com>
Subject: [slurm-users] Rolling reboot with at most N machines down simultaneously?

Occasionally I need to all the compute nodes in my system. However, I have a parallel file system which is converged, i.e., each compute node contributes a disk to the file system. The file system can tolerate having N nodes down simultaneously.

Therefore my problem is this - "Reboot all nodes, permitting N nodes to be rebooting simultaneously."

I have thought about the following options

A mass scontrol reboot - It doesn't seem like there is the ability to control how many nodes are being rebooted at once.
A job array - Job arrays can be easily configured to allow at most N jobs to be running simultaneously. However, I would need each array task to execute on a specific node, which does not appear to be possible.
Individual slurm jobs which reboot nodes - With a for loop, I could submit a reboot job for each node. But I'm not sure how to limit this so at most N jobs are running simultaneously. Perhaps a special partition is needed for this?

Open to hearing any other ideas.

Thanks!

Phil

Re: [slurm-users] Rolling reboot with at most N machines down simultaneously?

Reply via email to