On 29-09-2023 17:33, Ryan Novosielski wrote:
I’ll just say, we haven’t done an online/jobs running upgrade recently (in part because we know our database upgrade will take a long time, and we have some processes that rely on -M), but we have done it and it does work fine. So the paranoia isn’t necessary unless you know that, like us, the DB upgrade time is not tenable (Ole’s wiki has some great suggestions for how to test that, but they aren’t especially Slurm specific, it’s just a dry-run).
Slurm upgrades are clearly documented by SchedMD, and there's no reason to worry if you follow the official procedures. At least, it has always worked for us :-)
Just my 2 cents: The detailed list of upgrade steps/commands (first dbd, then ctld, then slurmds, finally login nodes) are documented in my Wiki page https://wiki.fysik.dtu.dk/Niflheim_system/Slurm_installation/#upgrading-slurm
The Slurm dbd upgrade instructions in https://wiki.fysik.dtu.dk/Niflheim_system/Slurm_installation/#make-a-dry-run-database-upgrade are totally Slurm specific, since that's the only database upgrade I've ever made :-) I highly recommend doing the database dry-run upgrade on a test node before doing the real dbd upgrade!
/Ole