Hi;

Please check the StateSaveLocation directory which should readable and writable by both slurmctld nodes and it should be a shared directory, not two local directory.

The explanation at below is taken from slurm web site:

"The backup controller recovers state information from the StateSaveLocation directory, which must be readable and writable from both the primary and backup controllers."

Regards;

Ahmet M.



20.09.2021 12:08 tarihinde Diego Zuccato yazdı:
Hello all.

After summer break, I noticed that rebooting one of the two slurmctld nodes kills & requeues all running jobs. Before the break it did not impact running jobs and nobody changed config during the break... Duh?

Today I just restarted slurmctld and slurmd: same kill&requeue.

I'm currently in the process of adding some nodes, but I already did it other times w/ no issues (actually the second slurmctld node have been installed to catch the race of a job terminating while the main slurmctld was shut down).

Anything I should double-check?

Tks.


Reply via email to