Hello,

I apologise that this email is a bit vague, however we are keen to understand 
the role of the Slurm "StateSave" location. I can see the value of the 
information in this location when, for example, we are upgrading Slurm and the 
database is temporarily down, however as I note above we are keen to gain a 
much better understanding of this directory.

We have two Slurm controller nodes (one of them is a backup controller), and 
currently we have put the "StateSave" directory on one of the global GPFS file 
stores. In other respects Slurm operates independently of the GPFS file stores 
-- apart from the fact that if GPFS fails jobs will subsequently fail. There 
was a GPFS failure when I was away from the university. Once GPFS had been 
restored they attempted to start Slurm, however the StateSave data was out of 
date. They eventually restarted Slurm, however lost all the queued jobs and the 
job sequence counter restarted at one.

Am I correct in thinking the the information in the StateSave location relates 
to the state of (a) jobs currently running on the cluster and (b) jobs queued? 
Am I also correct in thinking that this information is not stored in the slurm 
database? In other words if you lose the statesave data or it gets corrupted 
then you will lose all running/queued jobs?

Any advice on the management and location of the statesave directory in a dual 
controller system would be appreciated, please.

Best regards,
David

Reply via email to