So we use NHC for our automatic node closer. For reopening we have a
series of scripts that we use but they are all ad hoc and not
formalized. Same with closing off subsets of nodes we just have a bunch
of bash scripts that we have rolled to do that.
-Paul Edmon-
On 08/14/2018 05:16 AM, Pablo Llopis wrote:
Dear SLURM users,
I was wondering what kind of tools the community is using for
orchestrating SLURM operations.
For instance, say you want to execute an operation in the cluster
which requires draining the nodes first. What kind of tools are you
using to automate the state machine that would go through the
draining, applying the operation, then finally undraining the nodes?
(maybe even more convoluted procedures)
While it is possible to do these operations in a semi-manual fashion
by using a combination of automated tasks (scontrol and some
ansible/mco/bolt/whatever), this will usually result in manually
transitioning between drain -> apply operatation -> undrain. The
disadvantage of this is the overhead of keeping track of the state of
draining nodes (some of our jobs can run for many weeks). In addition,
if a set of nodes are drained at midnight or during the weekend, no
jobs will be able to run until an operator triggers the next step,
which means wasting precious computing resources with idle hours :)
This is where an orchestration tool would come in handy.
For doing reboots, scontrol reboot almost does all of this already,
but there may be other, more complex operations to be done in a
similar fashion.
Integration with a possible built-in healthcheck is also something to
consider, as the orchestration logic would need to take care of
disabling the healthcheck funcionality that automatically
restores/resumes drained nodes to avoid conflicts.
I would like to learn how the community deals with these kinds of
operations, whether you are using Open Source tools, or you developed
your own orchestration framework. Maybe you developed your own
SLURM-specific tools to deal with this?
Thanks!
Pablo