Re: [slurm-users] How do you orchestrate SLURM operations, what tools do you use?

Paul Edmon Wed, 15 Aug 2018 07:04:23 -0700

So we use NHC for our automatic node closer. For reopening we have aseries of scripts that we use but they are all ad hoc and notformalized. Same with closing off subsets of nodes we just have a bunchof bash scripts that we have rolled to do that.


-Paul Edmon-



On 08/14/2018 05:16 AM, Pablo Llopis wrote:

Dear SLURM users,
I was wondering what kind of tools the community is using fororchestrating SLURM operations.
For instance, say you want to execute an operation in the clusterwhich requires draining the nodes first. What kind of tools are youusing to automate the state machine that would go through thedraining, applying the operation, then finally undraining the nodes?(maybe even more convoluted procedures)
While it is possible to do these operations in a semi-manual fashionby using a combination of automated tasks (scontrol and someansible/mco/bolt/whatever), this will usually result in manuallytransitioning between drain -> apply operatation -> undrain. Thedisadvantage of this is the overhead of keeping track of the state ofdraining nodes (some of our jobs can run for many weeks). In addition,if a set of nodes are drained at midnight or during the weekend, nojobs will be able to run until an operator triggers the next step,which means wasting precious computing resources with idle hours :)
This is where an orchestration tool would come in handy.
For doing reboots, scontrol reboot almost does all of this already,but there may be other, more complex operations to be done in asimilar fashion.
Integration with a possible built-in healthcheck is also something toconsider, as the orchestration logic would need to take care ofdisabling the healthcheck funcionality that automaticallyrestores/resumes drained nodes to avoid conflicts.
I would like to learn how the community deals with these kinds ofoperations, whether you are using Open Source tools, or you developedyour own orchestration framework. Maybe you developed your ownSLURM-specific tools to deal with this?
Thanks!
Pablo

Re: [slurm-users] How do you orchestrate SLURM operations, what tools do you use?

Reply via email to