We don't follow the recommended procedure here but rather build RPMs and upgrade using those.  We haven't and any issues.  Here is our procedure:

1. Build rpms from source using a version of the slurm.spec file that we maintain. It's the version SchedMD provides but modified with some specific stuff for our env and to disable automatic restarts on upgrade which can cause problems especially for upgrading the slurm database.

2. We test the upgrade on our test cluster using the following sequence.

a. Pause all jobs and stop all scheduling.

b. Stop slurmctld and slurmdbd.

c. Backup spool and the database.

d. Upgrade slurm rpms (note that you need to make sure that the upgrade will not automatically restart the dbd or the ctld else you may end up in a world of hurt)

e. Run slurmdbd -Dvvvvv to do the database upgrade. Depending on the upgrade this can take a while because of database schema changes.

f. Restart slurmdbd using the service

g. Upgrading slurm rpms across the cluster using salt.

h. Global restart of slurmd and slurmctld.

3. If that all looks good we rinse and repeat on our production cluster.

The rpms have worked fine for us.  The main hitch is the automatic restart on upgrade, which I do not recommend.  You should neuter that portion of the provided spec file, especially for the slurmdbd upgrades.

We generally prefer the RPM method as it is the normal method for interaction with the OS and works well with Puppet.

-Paul Edmon-

On 11/2/2020 10:13 AM, Jason Simms wrote:
Hello all,

I am going to reveal the degree of my inexperience here, but am I perhaps the only one who thinks that Slurm's upgrade procedure is too complex? Or, at least maybe not explained in enough detail?

I'm running a CentOS 8 cluster, and to me, I should be able simply to update the Slurm package and any of its dependencies, and that's it. When I looked at the notes from the recent Slurm Users' Group meeting, however, I see that while that mode is technically supported, it is not recommended, and instead one should always rebuild from source. Really?

So, ok, regardless whether that's the case, the upgrade notes linked to in the prior post don't, in my opinion, go into enough detail. It tells you broadly what to do, but not necessarily how to do it. I'd welcome example commands for each step (understanding that changes might be needed to account for local configurations). There are no examples in that section, for example, addressing recompiling from source.

Now, I suspect a chorus of "if you don't understand it well enough, you shouldn't be managing it." OK. Perhaps that's fair enough. But I came into this role via a non-traditional route and am constantly trying to improve my admin skills, and I may not have the complete mastery of all aspects quite yet. But I would also say that documentation should be clear and complete, and not written solely for experts. To be honest, I've had to go to lots of documentation external to SchedMD to see good examples of actually working with Slurm, or even ask the helpful people on this group. And I firmly believe that if there is a packaged version of your software - as there is for Slurm - that should be the default, fully-working way to upgrade.

Warmest regards,
Jason

On Mon, Nov 2, 2020 at 9:28 AM Paul Edmon <ped...@cfa.harvard.edu <mailto:ped...@cfa.harvard.edu>> wrote:

    In general  I would follow this:

    https://slurm.schedmd.com/quickstart_admin.html#upgrade
    <https://slurm.schedmd.com/quickstart_admin.html#upgrade>

    Namely:

    Almost every new major release of Slurm (e.g. 19.05.x to 20.02.x)
    involves changes to the state files with new data structures, new
    options, etc. Slurm permits upgrades to a new major release from
    the past two major releases, which happen every nine months (e.g.
    18.08.x or 19.05.x to 20.02.x) without loss of jobs or other state
    information. State information from older versions will not be
    recognized and will be discarded, resulting in loss of all running
    and pending jobs. State files are *not* recognized when
    downgrading (e.g. from 19.05.x to 18.08.x) and will be discarded,
    resulting in loss of all running and pending jobs. For this
    reason, creating backup copies of state files (as described below)
    can be of value. Therefore when upgrading Slurm (more precisely,
    the slurmctld daemon), saving the /StateSaveLocation/ (as defined
    in /slurm.conf/) directory contents with all state information is
    recommended. If you need to downgrade, restoring that directory's
    contents will let you recover the jobs. Jobs submitted under the
    new version will not be in those state files, but it can let you
    recover most jobs. An exception to this is that jobs may be lost
    when installing new pre-release versions (e.g. 20.02.0-pre1 to
    20.02.0-pre2). Developers will try to note these cases in the NEWS
    file. Contents of major releases are also described in the
    RELEASE_NOTES file.

    So I wouldn't go directly to 20.x, instead I would go from 17.x to
    19.x and then to 20.x

    -Paul Edmon-

    On 11/2/2020 8:55 AM, Fulcomer, Samuel wrote:
    We're doing something similar. We're continuing to run production
    on 17.x and have set up a new server/cluster  running 20.x for
    testing and MPI app rebuilds.

    Our plan had been to add recently purchased nodes to the new
    cluster, and at some point turn off submission on the old cluster
    and switch everyone to submission on the new cluster (new
    login/submission hosts). That way previously submitted MPI apps
    would continue to run properly. As the old cluster partitions
    started to clear out we'd mark ranges of nodes to drain and move
    them to the new cluster.

    We've since decided to wait until January, when we've scheduled
    some downtime. The process will remain the same wrt moving nodes
    from the old cluster to the new, _except_ that everything will be
    drained, so we can move big blocks of nodes and avoid slurm.conf
    Partition line ugliness.

    We're starting with a fresh database to get rid of the bug
    induced corruption that prevents GPUs from being fenced with cgroups.

    regards,
    s

    On Mon, Nov 2, 2020 at 8:28 AM navin srivastava
    <navin.alt...@gmail.com <mailto:navin.alt...@gmail.com>> wrote:

        Dear All,

        Currently we are running slurm version 17.11.x and wanted to
        move to 20.x.

        We are building the New server with Slurm 20.2 version and
        planning to upgrade the client nodes from 17.x to 20.x.

        wanted to check if we can upgrade the Client from 17.x to
        20.x directly or we need to go through 17.x to 18.x and 19.x
        then 20.x

        Regards
        Navin.





--
*Jason L. Simms, Ph.D., M.P.H.*
Manager of Research and High-Performance Computing
XSEDE Campus Champion
Lafayette College
Information Technology Services
710 Sullivan Rd | Easton, PA 18042
Office: 112 Skillman Library
p: (610) 330-5632

Reply via email to