On 03-02-2022 21:59, Ryan Novosielski wrote:
On Feb 3, 2022, at 2:55 PM, Ole Holm Nielsen <ole.h.niel...@fysik.dtu.dk> wrote:
On 03-02-2022 16:37, Nathan Smith wrote:
Yes, we are running slurmdbd. We could arrange enough downtime to do an
incremental upgrade of major versions as Brian Andrus suggested, at least on
the slurmctld and slurmdbd systems. The slurmds I would just do a direct
upgrade once the scheduler work was completed.
As Brian Andrus said, you must upgrade Slurm by at most 2 major versions, and that
includes slurmd's as well! Don't do a "direct upgrade" of slurmd by more than
2 versions!
I recommend separate physical servers for slurmdbd and slurmctld. Then you can
upgrade slurmdbd without taking the cluster offline. It's OK for slurmdbd to
be down for many hours, since slurmctld caches the state information in the
meantime.
The one thing you want to watch out for here – maybe more so if you are using a
VM than a physical server as you may have sized the RAM for how much slurmctld
appears to need, as we did – is that that caching that takes place on the
slurmctld uses memory (I guess obviously, when you think about it). The result
there can be that eventually if you have slurmd down for a long time (we had
someone who was hitting a bug that would start running jobs right after
everyone went to sleep for example), your slurmctld can run out of memory,
crash, and then that cache is lost. You don’t normally see that memory being
used like that, because slurmdbd is normally up/accepting the accounting data.
The slurmctld caches job state information in:
# scontrol show config | grep StateSaveLocation
StateSaveLocation = /var/spool/slurmctld
The StateSaveLocation should retain job information even if slurmctld
crashes (at least the data which have been committed to disk).
The StateSaveLocation file system must not fill up, of course! There
are also some upper limits to the number of records in
StateSaveLocation, but I can't find the numbers right now.
/Ole