Run a secondary controller.
Do 'scontrol takeover' before any changes, make your changes and restart
slurmctld on the primary.
If it fails, no harm/no foul, because the secondary is still running
happily. If it succeeds, it takes control back and you can then restart
the secondary with the new (known good) config.
Brian Andrus
On 1/17/2023 12:36 PM, Groner, Rob wrote:
So, you have two equal sized clusters, one for test and one for
production? Our test cluster is a small handful of machines compared
to our production.
We have a test slurm control node on a test cluster with a test
slurmdbd host and test nodes, all named specifically for test. We
don't want a situation where our "test" slurm controller node is named
the same as our "prod" slurm controller node, because the possibility
of mistake is too great. ("I THOUGHT I was on the test network....")
Here's the ultimate question I'm trying to get answered.... Does
anyone update their slurm.conf file on production outside of an
outage? If so, how do you KNOW the slurmctld won't barf on some
problem in the file you didn't see (even a mistaken character in there
would do it)? We're trying to move to a model where we don't have
downtimes as often, so I need to determine a reliable way to continue
to add features to slurm without having to wait for the next outage.
There's no way I know of to prove the slurm.conf file is good, except
by feeding it to slurmctld and crossing my fingers.
Rob
------------------------------------------------------------------------
*From:* slurm-users <slurm-users-boun...@lists.schedmd.com> on behalf
of Fulcomer, Samuel <samuel_fulco...@brown.edu>
*Sent:* Wednesday, January 4, 2023 1:54 PM
*To:* Slurm User Community List <slurm-users@lists.schedmd.com>
*Subject:* Re: [slurm-users] Maintaining slurm config files for test
and production clusters
You don't often get email from samuel_fulco...@brown.edu. Learn why
this is important <https://aka.ms/LearnAboutSenderIdentification>
Just make the cluster names the same, with different Nodename and
Partition lines. The rest of slurm.conf can be the same. Having two
cluster names is only necessary if you're running production in a
multi-cluster configuration.
Our model has been to have a production cluster and a test cluster
which becomes the production cluster at yearly upgrade time (for us,
next week). The test cluster is also used for rebuilding MPI prior to
the upgrade, when the PMI changes. We force users to resubmit jobs at
upgrade time (after the maintenance reservation) to ensure that MPI
runs correctly.
On Wed, Jan 4, 2023 at 12:26 PM Groner, Rob <rug...@psu.edu> wrote:
We currently have a test cluster and a production cluster, all on
the same network. We try things on the test cluster, and then we
gather those changes and make a change to the production cluster.
We're doing that through two different repos, but we'd like to
have a single repo to make the transition from testing configs to
publishing them more seamless. The problem is, of course, that
the test cluster and production clusters have different cluster
names, as well as different nodes within them.
Using the include directive, I can pull all of the NodeName lines
out of slurm.conf and put them into %c-nodes.conf files, one for
production, one for test. That still leaves me with two problems:
* The clustername itself will still be a problem. I WANT the
same slurm.conf file between test and production...but the
clustername line will be different for them both. Can I use
an env var in that cluster name, because on production there
could be a different env var value than on test?
* The gres.conf file. I tried using the same "include" trick
that works on slurm.conf, but it failed because it did not
know what the "ClusterName" was. I think that means that
either it doesn't work for anything other than slurm.conf, or
that the clustername will have to be defined in gres.conf as well?
Any other suggestions of how to keep our slurm files in a single
source control repo, but still have the flexibility to have them
run elegantly on either test or production systems?
Thanks.