Occasionally I need to all the compute nodes in my system. However, I have a parallel file system which is *converged*, i.e., each compute node contributes a disk to the file system. The file system can tolerate having N nodes down simultaneously.
Therefore my problem is this - "Reboot all nodes, permitting N nodes to be rebooting simultaneously." I have thought about the following options - A mass scontrol reboot - It doesn't seem like there is the ability to control how many nodes are being rebooted at once. - A job array - Job arrays can be easily configured to allow at most N jobs to be running simultaneously. However, I would need each array task to execute on a specific node, which does not appear to be possible. - Individual slurm jobs which reboot nodes - With a for loop, I could submit a reboot job for each node. But I'm not sure how to limit this so at most N jobs are running simultaneously. Perhaps a special partition is needed for this? Open to hearing any other ideas. Thanks! Phil