Re: [slurm-users] Help buidling slurm on OS X High Sierra

2018-06-24 Thread Benjamin Redling
On 24/06/18 01:27, George Hartzell wrote:
> I'm trying to build Slurm on a Mac running OS X Sierra (via Spack)

[...]

> There are enough mentions of darwin in the src tree that it seems like
> it should work (or has worked).
> 
> Am I chasing something hopeless?

Maybe:
https://slurm.schedmd.com/platforms.html

Good luck,
Benjamin



Re: [slurm-users] Help buidling slurm on OS X High Sierra

2018-06-27 Thread Benjamin Redling
On 24/06/18 22:04, Pär Lindfors wrote:
> On 06/24/2018 01:55 PM, Benjamin Redling wrote:

>> https://slurm.schedmd.com/platforms.html

> Does not seem to have been updated in a while. Solaris support was
> removed recently, probably in 17.11.


True. Apart form the things getting worse, do you know any improvements
relevant for the OP (macOS High Sierra, 10.13)?
-- I just use High Sierra for audio/e-guitar, bass and as a to a
terminal to proper workhorses sometimes and I'm not keen to change that
after wrangling with the new file system from a clean install (what a
mess that installer is!)

BR
-- 
FSU Jena | JULIELab.de/Staff/Benjamin+Redling.html




Re: [slurm-users] Resource sharing between different clusters

2018-10-19 Thread Benjamin Redling
On 18/10/2018 18:16, Cao, Lei wrote:
> I am pretty new to slurm so please bear with me. I have the following 
> scenario and I wonder if slurm currently supports this in someway.
> 
> Let's say I have 3 clusters. Cluster1 and cluster2 run their own 
> slurmctld and slurmds(this is a hard requirement), but both of them need to 
> share cluster3 so they can schedule jobs on it. Is this currently supported? 
> If so, how can I achieve this?

Have you seen:
https://slurm.schedmd.com/federation.html
?

BR
-- 
FSU Jena | JULIELab.de/Staff/Redling/
☎ +49 3641 9 44323



Re: [slurm-users] How to partition nodes into smaller units

2019-02-09 Thread Benjamin Redling
Hello,

On 05.02.19 16:46, Ansgar Esztermann-Kirchner wrote:
> [...]-- we'd like to have two "half nodes", where
> jobs will be able to use one of the two GPUs, plus (at most) half of
> the CPUs. With SGE, we've put two queues on the nodes, but this
> effectively prevents certain maintenance jobs from running.
> How would I configure these nodes in Slurm?

why don't you use an additional "maintenance" queue/partition
containing the whole nodes?

Both SGE and SLURM support that.

> From the docs I gathered
> that MaxTRESPerJob would be a solution, but this is coupled to
> associations, which I do not fully understand. 
> Is this the best/only way to achieve such a partioning? 
> If so, do I need to define an association for every user, or can I
> define a default/skeleton association that new users automatically
> inherit?
> Are there other/better ways to go?

Let's agree on "other" ;)
use the OS to partition the resources on the host -- VM, systemd-nspawn,
... .

Because we have to run VMs and services parallel to SLURM I tested
partitioning our (small number of) hosts via Ganeti/KVM.
Side-effect: I was able to live migrate the (virtual) node running jobs
during maintenance.
Performance was very close to bare metal and while we are currently not
running our GPU jobs this way, even GPU pass-through should be possible
with negligible performance penalty:

Walters, et al. "GPU Passthrough Performance: A Comparison of KVM, Xen,
VMWare ESXi, and LXC for CUDA and OpenCL Applications"
https://ieeexplore.ieee.org/document/6973796?tp=&arnumber=6973796

Disadvantage of the OS-level partitioning might be additional effort
that's necessary.
But honestly, I even thought about stretching this even further for two
reason:
1. to gain a bit more flexibility to the (poor?) elastic features of
SLURM by defining purely virtual nodes of different size and start
whatever selection fits on a case by case basis
-- then again I wouldn't want to do that for 900 hosts without a proper
helper program.
2. separate (conservative) host OS and (modern, but stable) node OS to
ease up different constraints (we had back than)


Regards,
Benjamin
-- 
FSU Jena | JULIELab.de/Staff/Redling
☎ +49 3641 9 44323



Re: [slurm-users] How should I do so that jobs are allocated to the thread and not to the core ?

2019-05-02 Thread Benjamin Redling
Have you Seen the slurm FAQ?
You may want to search on that site for "Hyperthreading"

(Sorry for the TOFU. vacation, mobile)

Am 30. April 2019 18:07:03 MESZ schrieb Jean-mathieu CHANTREIN 
:
>Hello, 
>
>Most jobs of my users are single-thread. I have multithreaded
>processors. The jobs seem to reserve 2 logical CPU (1 core=2 CPU (2
>threads)) whereas it only uses 1 logical CPU(1 thread). Nevertheless,
>my slurm.conf file indicates: 
>
>[...] 
>SelectType = select / cons_res 
>SelectTypeParameters = CR_CPU 
>FastSchedule = 1 
>[...] 
>NodeName = DEFAULT Boards = 1 SocketsPerBoard = 2 CoresPerSocket = 18
>ThreadsPerCore = 2 RealMemory = 128000 
>
>And here is an excerpt from the output of a job running on this type of
>node: 
>$ scontrol show job idjob 
>[...] 
>NumNodes=1 NumCPUs=1 NumTasks=1 CPUs/Task=1 ReqB:S:C:T=0:0:*:* 
>TRES=cpu=1,mem=200M,node=1,billing=1 
>Socks/Node=* NtasksPerN:B:S:C=0:0:*:* CoreSpec=* 
>MinCPUsNode=1 MinMemoryCPU=200M MinTmpDiskNode=0 
>Features=(null) DelayBoot=00:00:00 
>OverSubscribe=OK Contiguous=0 Licenses=(null) Network=(null) 
>
>But with [
>https://github.com/OleHolmNielsen/Slurm_tools/tree/master/pestat |
>pestat  ] , I can see my node use 72/72 CPU but have only 36 job
>running like this one. 
>
>How should I do so that jobs are allocated to the thread and not to the
>core ? 
>
>Best regards. 
>
>Jean-Mathieu 
>
>   

-- 
Diese Nachricht wurde von meinem Android-Gerät mit K-9 Mail gesendet.

Re: [slurm-users] [Long] Why are tasks started on a 30 second clock?

2019-07-25 Thread Benjamin Redling




On 25.07.19 20:11, Kirill Katsnelson wrote:
On Thu, Jul 25, 2019 at 8:16 AM Mark Hahn > wrote:


how about a timeout from elsewhere?  for instance, when I see a 30s
delay,
I normally at least check DNS, which can introduce such quantized
delays.


Thanks, it's a good guess, but is very unlikely the case.


If the 30s delay is only for jobs after the first full queue than it is 
backfill in action?


bf_interval=#
The number of seconds between backfill iterations
...
Default: 30



Regards,
Benjamin



Re: [slurm-users] OverSubscribe=FORCE:1overloads nodes?

2019-09-08 Thread Benjamin Redling
Hello Menglong,

which selection plugin and in case of of cons_res what consumable resources 
have you configured?
Maybe review:
https://slurm.schedmd.com/cons_res_share.html

Regards,
Benjamin

Am 9. September 2019 03:38:13 MESZ schrieb "hu...@sugon.com" :
>Dear there,
>I have two jobs in my cluster, which has 32 cores per compute node. The
>first job uses eight nodes and 256 cores, which means it takes up all
>eight nodes. The second job uses five nodes and 32 cores, which means
>only partial cores of five nodes will be used. Slurm, however,
>allocated some of the same nodes for the two jobs, resulting in
>overload of these nodes. I wonder if my partition configuration
>OverSubscribe=FORCE:1 caused this to happen. How to prevent this from
>happening?
>Appreciatively,
>Menglong

-- 
FSU Jena | https://julielab.de/Staff/Redling/



[slurm-users] Archived docs show 19.05 news

2019-10-24 Thread Benjamin Redling
Hello everybody,

confusing:

https://slurm.schedmd.com/archive/slurm-18.08.8/news.html
"
RELEASE NOTES FOR SLURM VERSION 19.05
28 May 2019
...
"

Bug-tracking is only via commercial support?

Regards,
Benjamin



Re: [slurm-users] RHEL8 support

2019-10-28 Thread Benjamin Redling
On 28/10/2019 08.26, Bjørn-Helge Mevik wrote:
> Taras Shapovalov  writes:
> 
>> Do I understand correctly that Slurm19 is not compatible with rhel8? It is
>> not in the list https://slurm.schedmd.com/platforms.html
> 
> It says
> 
> "RedHat Enterprise Linux 7 (RHEL7), CentOS 7, Scientific Linux 7 (and newer)"
> 
> Perhaps that includes RHEL8, and CentOS 8, not only Scientific Linux 8?

AFAIK there won't be a Scientific Linux 8 (by Fermilab):
https://listserv.fnal.gov/scripts/wa.exe?A2=SCIENTIFIC-LINUX-ANNOUNCE;11d6001.1904

So it seems if there aren't any other maintainers taking care of a
potential SL8 and "and newer" was written intentionally it has to be
RHEL or CentOS 8.

Regards,
Benjamin
-- 
FSU Jena | https://JULIELab.de/Staff/Redling/
☎  +49 3641 9 44323



[slurm-users] ProEpiLogInterfacePlugin -> PerilogueInterfacePlugin (E.A. Schneider @ CMU'76?)

2020-02-21 Thread Benjamin Redling
Hello everybody,

only yesterday I had time to review:
https://slurm.schedmd.com/SLUG19/Slurm_20.02_and_Beyond.pdf
"
If you have a good name for this plugin type, I haven't found a good
name - "ProEpiLogInterfacePlugin" is a bit unwieldy
"

So, I searched for a hypernym of "prologue" and "epilog" and couldn't
directly find something.
But IMO, I found something very closely related.
If there isn't already a better name, I suggest
"PerilogueInterfacePlugin", because of the following possible historical
IT-roots:

https://jdebp.eu/FGA/function-perilogues.html
"
Yes, "perilogue" is a real word — sort of. It's only ever been used as a
technical term in computing, and was first used by Edward Anton
Schneider of Carnegie-Mellon University in 1976 to mean the start or
finish of an operation. Clearly this is a useful term for the
combination of a prologue and an epilogue, which are inseparable from
each other when it comes to discussions of compiled functions in
computer languages, and lack another word for their combination.

As "prologue" comes from the Greek "προ", meaning "before", and as
"epilogue" comes from the Greek "επι", meaning "after", so "perilogue"
comes from the Greek "περι", meaning "around/about". Indeed, the word
"περιλεγειν" actually exists in Classical Greek, in the writings of
Hermippus, meaning "to talk around" something.
"


Regards,
Benjamin
-- 
FSU Jena | JULIELab.de/Staff/Redling



Re: [slurm-users] slurm & rstudio

2020-07-20 Thread Benjamin Redling
Hi Kush,

have you tried searching for (parts of) that error message? Is R Studio Pro 
aware of the following changes?
https://lists.schedmd.com/pipermail/slurm-users/2018-May/001296.html

Regards,
Benjamin

Am 20. Juli 2020 16:50:08 MESZ schrieb "Sidhu, Khushwant" 
:
>Hi,
>
>I'm trying to use rstudio & slurm to submit jobs but am getting the
>attached error.
>Has anyone come across it or know what I'm doing wrong ?
>
>Thanks
>
>Regards
>
>Khush
>
>Disclaimer: This email and any attachments are sent in strictest
>confidence for the sole use of the addressee and may contain legally
>privileged, confidential, and proprietary data. If you are not the
>intended recipient, please advise the sender by replying promptly to
>this email and then delete and destroy this email and any attachments
>without any further use, copying or forwarding.

-- 
Diese Nachricht wurde von meinem Android-Gerät mit K-9 Mail gesendet.

Re: [slurm-users] SLURM on a large shared memory node

2020-12-03 Thread Benjamin Redling

Hello Benson,

On 24/11/2020 14.20, Benson Muite wrote:
Am setting up SLURM on a single shared memory machine. Found the 
following blog post:

http://rolk.github.io/2015/04/20/slurm-cluster


sorry, but that is only a random, outdated blog post from 2015.
Even the Debian 9 stretch provided Slurm 16.05 has automatic handling of 
cgroups -- you don't have to set them up manually.


I recommend looking up which version is packaged for your distribution 
if you're not going for compilation from source and depending on your 
choice start either with the official documentation for the current 
version 20.11

https://slurm.schedmd.com/,
or with the documentation related to your available, packaged version in 
the archive

https://slurm.schedmd.com/archive/


The main suggestion is to use cgroups to partition the resources. Are 
ther any other suggestions of changes to implement that differ from the 
standard cluster setup?


I would start with the defaults and read, read, ..., read, while trying 
to add features step by step.
I think imitating a setup you don't really understand is a really bad 
idea, there will be more than enough questions, even when starting with 
the basics.


Start slow. Try and look up the defaults of your packaged version, if 
you're compiling from source, use the config generators from SchedMD 
after reading through the basics.
You could run multiple jobs on a single node before cgroups. Try to find 
the relevant sections in the official documentation to understand, how 
that works, it's limitations and why it might be a good thing to use 
cgroups nowadays.
Again, having a vague idea that you want to "partition" the node won't 
bring you very far. IMO it's better to have at least a basic idea of 
Slurm operation.

[D.C.]
What do you want next?
(The first thing I wanted in a cluster was select/cons_res with 
CR_Core_Memory instead of the default select/linear. RTFM what that all 
means and why that is/isn't a good idea in your case; when or when not 
to use CR_CPU_Memory; next was understanding backfilling and it's 
requirement for time limits).

Read.
Test.
Optionally ask on the list if you're having a single concrete issue.
[Da Capo al Fine]

(In parallel and repeatedly):
1a) The official documentation
1b) Ole Holm Nielson's docs, starting at 
https://wiki.fysik.dtu.dk/niflheim/Slurm_installation -- even if you're 
using a Debian-based distribution, read it to get an understanding of 
the different parts a Slurm installation is made of.


(For anything beyond the basics you didn't grasp from 1a & b):
2) Blog of Chris Samuel (csamuel.org)

Reading the list for a longer time and trying to understand the topics 
that might be applicable to your setup will help a lot -- and you'll 
notice who else on the list you prefer reading / who is willing to 
answer questions you have / has similar issues that get answers you can 
learn from.


Good luck,
Benjamin
--
FSU Jena | JULIELab.de/Staff/Redling



Re: [slurm-users] Assigning two "cores" when I'm only request one.

2021-07-13 Thread Benjamin Redling

On 12/07/2021 21.16, Luis R. Torres wrote:
I'm trying to run one task on one "core", however, when I test the 
affinity, the system gives me "two"; I'm assuming the two are threads 
since the system is a dual socket system. [...]


AFAI understand, you reason that the no. of hw threads depends on the 
number of sockets ("since")? Why? e.g. quad socket = four threads? 
Cardinality of the sockets is coincidentally "two" -- like a lot of things.


This is slurms "notion" of the sockets, core, thread relations:
https://slurm.schedmd.com/mc_support.html


Regards,
Benjamin
--
FSU Jena | ksz.uni-jena.de/mitarbeiter & JULIELab.de/Staff/Redling



Re: [slurm-users] Having errors trying to run a packed jobs script

2017-11-07 Thread Benjamin Redling
Hello Marius,

Am 07.11.2017 um 10:12 schrieb Marius Cetateanu:
> I have a very small cluster(if it even could be called a cluster) with only
> one node for the moment; the node is a dual Xeon with 14 cores/socket,
> hyper-threaded and 256GB of memory, running CentOS 7.3.

Bigger than a small cluster a decade ago... ;)
Nice workhorse I guess.

[...]
> The moment I schedule my script I can see that there are 50 instances of
> my process started and running but just a bit afterwards only 5 or so of
> them
> 
> I can see running - so I only get full load for the first 50 instances
> and not afterwards.
"a bit afterwards" is too vague to reason anything aside sched_interval
just being the default 60s:
https://slurm.schedmd.com/sched_config.html

What's the (average) runtime of the jobs?
If your jobs are not running longer than the sched_interval default you
might want to *decrease* that.

Regards,
Benjamin
-- 
FSU Jena | JULIELab.de/Staff/Benjamin+Redling.html
☎ +49 3641 9 44323



Re: [slurm-users] [slurm-dev] Re: Installing SLURM locally on Ubuntu 16.04

2017-11-08 Thread Benjamin Redling

On 11/8/17 3:01 PM, Douglas Jacobsen wrote:
Also please make sure you have the slurm-munge package installed (at 
least for the RPMs this is the name of the package, I'm unsure if that 
packaging layout was conserved for Debian)

nope, it's just "munge"

Regards,
Benjamin
--
FSU Jena | JULIELab.de/Staff/Benjamin+Redling.html
☎ +49 3641 9 44323


--
FSU Jena | JULIELab.de/Staff/Benjamin+Redling.html
☎ +49 3641 9 44323



Re: [slurm-users] [slurm-dev] Re: Installing SLURM locally on Ubuntu 16.04

2017-11-13 Thread Benjamin Redling

On 11/12/17 4:52 PM, Gennaro Oliva wrote:

On Sun, Nov 12, 2017 at 10:03:18AM -0500, Will L wrote:



I just tried `sudo apt-get remove --purge munge`, etc., and munge itself



this should have uninstalled slurm-wlm also, did you reinstalled it with apt?



seems to be working fine. But I still get `slurmctld: error: Couldn't find
the specified plugin name for crypto/munge looking at all files`. Is there



if you didn't reinstall slurm with apt you may be using the slurmctld
executable from a failed source installation, and for some reason this
can't find the corresponding plugin directory.



I suggest to try to install the slurm-wlm package with:



apt-get install slurm-wlm

I would *currently* avoid the Debian (Stretch) packages like the plague:
the last update tried to (re)start slurmctld which -- surprise, surprise 
-- fails on every node that's not the master with an exit code that 
leaves the packages unconfigured.


That raises the question if anyone did bother to test them on a 
multi-node cluster


I'm still hoping I messed that up and not the maintainer. Maybe I expect 
to much from an "apt upgrade" nowadays...


Regards,
Benjamin
--
FSU Jena | JULIELab.de/Staff/Benjamin+Redling.html
☎ +49 3641 9 44323



Re: [slurm-users] Priority wait

2017-11-13 Thread Benjamin Redling

Hi Roy,

On 11/13/17 2:37 PM, Roe Zohar wrote:
[...]
I sent 3000 jobs with feature Optimus and part are running while part 
are pendind. Which is ok.
But I have sent 1000 jobs to Megatron and they are all in pending 
stating they wait because of priority. Whay os that?


B.t.w if I change their priority to a higher one, they start to run on 
Megatron.


my guess: is if you can provide the slurm.conf of that cluster, the 
probability anyone will sacrifice his spare-time for you will increase 
significantly.


Regards,
Benjamin
--
FSU Jena | JULIELab.de/Staff/Benjamin+Redling.html
☎ +49 3641 9 44323



Re: [slurm-users] slurm conf with single machine with multi cores.

2017-11-29 Thread Benjamin Redling



On 11/29/17 4:32 PM, david vilanova wrote:

Hi,
I have updated the slurm.conf as follows:

SelectType=select/cons_res
SelectTypeParameters=CR_CPU
NodeName=linuxcluster CPUs=2
PartitionName=testq Nodes=linuxcluster Default=YES MaxTime=INFINITE State=UP

Still get testq node in down status ??? Any idea ?

Below log from db and controller:
==> /var/log/slurm/slurmctrl.log <==
[2017-11-29T16:28:30.446] slurmctld version 17.11.0 started on cluster 
linuxcluster
[2017-11-29T16:28:30.850] error: SelectType specified more than once, 


... if it says so you 

Have you checked your slurm.conf?

Regards,
Benjamin
--
FSU Jena | JULIELab.de/Staff/Benjamin+Redling.html
☎ +49 3641 9 44323



Re: [slurm-users] slurm conf with single machine with multi cores.

2017-12-02 Thread Benjamin Redling
Am 30.11.2017 um 09:31 schrieb david vilanova:
> Here below my slurm.conf file:

> NodeName=linuxcluster CPUs=12
> PartitionName=testq Nodes=linuxclusterDefault=YES MaxTime=INFINITE State=UP

missing white space! Nodes=linuxclusterDefault=Yes is not a node node,

> [2017-11-30T09:24:28.430] layouts: no layout to initialize

thus "no layout to initalize"?!


Regards,
Benjamin
-- 
FSU Jena | JULIELab.de/Staff/Benjamin+Redling.html
☎ +49 3641 9 44323



Re: [slurm-users] Remote submission hosts and security

2017-12-05 Thread Benjamin Redling
Am 05.12.2017 um 22:27 schrieb Jeff White:
> I have a need to allow a server which is outside of my cluster access to
> submit jobs to the cluster.  I can do that easily enough by handing my
> Slurm RPMs, config, and munge key to the owner of that server and
> opening access in my firewall.  However, since it is a system outside of
> my control the owner of it can become root (or impersonate any user they
> wish) and gain full control of Slurm.  Obviously that's not good.
> 
> Are there any mechanisms for allowing a remote host to submit jobs but
> not have any administrative access to Slurm?

you could restrict ssh to executing sbatch (authorized_keys... command=)
and not allow a login, and to allow scp-ing of job files you could
combine that with "rssh"?

Some institutions go the extra mile to build their own (web) portals.
AFAIK to only transfer the job and the user name (or any needed data) to
a service that will executing the slurm job.

@M.? Reading this? Portal project finished and allowed to give details?

Regards,
Benjamin
-- 
FSU Jena | JULIELab.de/Staff/Benjamin+Redling.html
☎ +49 3641 9 44323



Re: [slurm-users] Multithreads config

2018-02-16 Thread Benjamin Redling
Am 16.02.2018 um 15:28 schrieb david martin:
> *I have a single physical server with :*

>   * *63 cpus (each cpu has 16 cores) *
>   * *480Gb total memory*
> 

> *NodeNAME= Sockets=1 CoresPerSocket=16 ThreadsPerCore=1 Procs=63
> REALMEMORY=48***


> *This configuration will not work. What is should be ?*

A proper configuration that shows basic quantities of effort went into
reading the documentation.

RTFM and use the configurator:
https://slurm.schedmd.com/configurator.html

You failed to define a nodename and apart from that just defining a node
isn't enough -- you need at least a partition that uses that node...

Regards,
Benjamin
-- 
FSU Jena | JULIELab.de/Staff/Benjamin+Redling.html



Re: [slurm-users] slurm jobs are pending but resources are available

2018-04-17 Thread Benjamin Redling
Hello,

Am 16.04.2018 um 18:50 schrieb Michael Di Domenico:
> On Mon, Apr 16, 2018 at 6:35 AM,   wrote:

> perhaps i missed something in the email, but it sounds like you have
> 56 cores, you have two running jobs that consume 52 cores, leaving you
> four free.  

No. From the original mail:
<--- %< --->
"scontrol show nodes cn_burebista" gives me the following:

NodeName=cn_burebista Arch=x86_64 CoresPerSocket=14
   CPUAlloc=32
<--- %< --->

Jobs 2356 and 2357 use 32 CPUs as long as the original poster gave the
right numbers.

Regards,
Benjamin
-- 
FSU Jena | JULIELab.de/Staff/Benjamin+Redling.html
☎ +49 3641 9 44323



Re: [slurm-users] Include some cores of the head node to a partition

2018-04-22 Thread Benjamin Redling
Hello Mahmood,

Am 22.04.2018 um 04:55 schrieb Mahmood Naderan:
> I think that will limit other nodes to 20 too. Isn't that?

you can declare less CPUs than phys. available. I do that for our
cluster; it is working robust for ages.

> Currently computes have 32 cores per node and I want all 32 cores. The
> head node also has 32 core but I want to include only 20 cores.

Agree with you. You don't need to restrict the whole
partition/cluster/etc. to only 20 cores when you only need that for a
single node.

> On Sun, Apr 22, 2018, 03:53 Chris Samuel  > wrote:
> 
> 
> All you need to do is add "MaxCPUsPerNode=20" to that to limit the
> number of
> cores that the partition can use.
> 
> We do this for our non-GPU job partition to reserve some cores for
> the GPU job
> partition

That might be a valid solution for that cluster.
But it is a wasteful solution to Mahmood's.

Regards,
Benjamin
-- 
FSU Jena | JULIELab.de/Staff/Benjamin+Redling.html
☎ +49 3641 9 44323



Re: [slurm-users] Controller / backup controller q's

2018-05-25 Thread Benjamin Redling
Am 24.05.2018 um 17:43 schrieb Will Dennis:
> 3)  What are the steps to replace a primary controller, given that a
> backup controller exists? (Hopefully this is already documented
> somewhere that I haven’t found yet)
Why not drive such a small cluster with a single primary controller in a
migratable or even HA-VM?

I've never understood what a backup controller is good for when you need
very reliable/HA-shared storage and a very reliabe/HA-MySQL (*cought*)
installation on top off that (or live with the illusion that two
processes instead of one depending on two SPOFs make it a dependable
cluster)
-- apart from very big clusters where a single controller VM with all
the dependencies won't cut it because it might become a bottleneck.

But I'm eager to learn about the advantages of a backup controller (and
the necessary[?] complexity)

Regards,
BR
-- 
FSU Jena | JULIELab.de/Staff/Benjamin+Redling.html