Hi;
Please check the StateSaveLocation directory which should readable and
writable by both slurmctld nodes and it should be a shared directory,
not two local directory.
The explanation at below is taken from slurm web site:
"The backup controller recovers state information from the
StateSaveLocation directory, which must be readable and writable from
both the primary and backup controllers."
Regards;
Ahmet M.
20.09.2021 12:08 tarihinde Diego Zuccato yazdı:
Hello all.
After summer break, I noticed that rebooting one of the two slurmctld
nodes kills & requeues all running jobs. Before the break it did not
impact running jobs and nobody changed config during the break... Duh?
Today I just restarted slurmctld and slurmd: same kill&requeue.
I'm currently in the process of adding some nodes, but I already did
it other times w/ no issues (actually the second slurmctld node have
been installed to catch the race of a job terminating while the main
slurmctld was shut down).
Anything I should double-check?
Tks.