On 03-02-2022 21:59, Ryan Novosielski wrote:
On Feb 3, 2022, at 2:55 PM, Ole Holm Nielsen <ole.h.niel...@fysik.dtu.dk> wrote:

On 03-02-2022 16:37, Nathan Smith wrote:
Yes, we are running slurmdbd. We could arrange enough downtime to do an 
incremental upgrade of major versions as Brian Andrus suggested, at least on 
the slurmctld and slurmdbd systems. The slurmds I would just do a direct 
upgrade once the scheduler work was completed.

As Brian Andrus said, you must upgrade Slurm by at most 2 major versions, and that 
includes slurmd's as well!  Don't do a "direct upgrade" of slurmd by more than 
2 versions!

I recommend separate physical servers for slurmdbd and slurmctld.  Then you can 
upgrade slurmdbd without taking the cluster offline.  It's OK for slurmdbd to 
be down for many hours, since slurmctld caches the state information in the 
meantime.

The one thing you want to watch out for here – maybe more so if you are using a 
VM than a physical server as you may have sized the RAM for how much slurmctld 
appears to need, as we did – is that that caching that takes place on the 
slurmctld uses memory (I guess obviously, when you think about it). The result 
there can be that eventually if you have slurmd down for a long time (we had 
someone who was hitting a bug that would start running jobs right after 
everyone went to sleep for example), your slurmctld can run out of memory, 
crash, and then that cache is lost. You don’t normally see that memory being 
used like that, because slurmdbd is normally up/accepting the accounting data.

The slurmctld caches job state information in:
# scontrol show config | grep StateSaveLocation
StateSaveLocation       = /var/spool/slurmctld

The StateSaveLocation should retain job information even if slurmctld crashes (at least the data which have been committed to disk).

The StateSaveLocation file system must not fill up, of course! There are also some upper limits to the number of records in StateSaveLocation, but I can't find the numbers right now.

/Ole

Reply via email to