On 26-05-2021 20:23, Will Dennis wrote:
About to embark on my first Slurm upgrade (building from source now, into a versioned path /opt/slurm/<vernum>/ which is then symlinked to /opt/slurm/current/ for the “in-use” one…) This is a new cluster, running 20.11.5 (which we now know has a CVE that was fixed in 20.11.7) but I have researchers running jobs on it currently. As I’m still building out the cluster, I found today that all Slurm source tarballs before 20.11.7 were withdrawn by SchedMD. So, need to upgrade at least the -ctld and -dbd nodes before I can roll any new nodes out on 20.11.7…

As I have at least one researcher that is running some long multi-day jobs, can I down the -dbd and -ctld nodes and upgrade them, then put them back online running the new (latest) release, without munging the jobs on the running worker nodes?

I recommend strongly to read the SchedMD presentations in the https://slurm.schedmd.com/publications.html page, especially the "Field notes" documents. The latest one is "Field Notes 4: From The Frontlines of Slurm Support", Jason Booth, SchedMD.

We upgrade Slurm continuously while the nodes are in production mode. There's a required order of upgrading: first slurmdbd, then slurmctld, then slurmd nodes, and finally login nodes, see
https://wiki.fysik.dtu.dk/niflheim/Slurm_installation#upgrading-slurm

The detailed upgrading commands for CentOS are in
https://wiki.fysik.dtu.dk/niflheim/Slurm_installation#upgrading-on-centos-7

We don't have any problems with running jobs across upgrades, but perhaps others can share their experiences?

/Ole

Reply via email to