Tks. Checked it: it's on the home filesystem, NFS-shared between the nodes. Well, actually a bit more involved than that: JobCompLoc points to /var/spool/jobscompleted.txt but /var/spool/slurm is actually a symlink to /home/conf/slurm_spool .

root@str957-cluster:/# grep spool /etc/slurm.conf
JobCompLoc=/var/spool/slurm/jobscompleted.txt
root@str957-cluster:/# ls -l /var/spool/
[...]
lrwxrwxrwx 1 root root 22 apr 16 08:12 slurm -> /home/conf/slurm_spool

The symlinks are on both nodes and the home is mounted.

When can the jobscompleted.txt file be removed? Maybe some weird character slipped in and it messes the parsing? Can I test it?

Il 20/09/2021 12:33, mercan ha scritto:
Hi;

Please check the StateSaveLocation directory which should readable and writable by both slurmctld nodes and it should be a shared directory, not two local directory.

The explanation at below is taken from slurm web site:

"The backup controller recovers state information from the StateSaveLocation directory, which must be readable and writable from both the primary and backup controllers."

Regards;

Ahmet M.



20.09.2021 12:08 tarihinde Diego Zuccato yazdı:
Hello all.

After summer break, I noticed that rebooting one of the two slurmctld nodes kills & requeues all running jobs. Before the break it did not impact running jobs and nobody changed config during the break... Duh?

Today I just restarted slurmctld and slurmd: same kill&requeue.

I'm currently in the process of adding some nodes, but I already did it other times w/ no issues (actually the second slurmctld node have been installed to catch the race of a job terminating while the main slurmctld was shut down).

Anything I should double-check?

Tks.


--
Diego Zuccato
DIFA - Dip. di Fisica e Astronomia
Servizi Informatici
Alma Mater Studiorum - Università di Bologna
V.le Berti-Pichat 6/2 - 40127 Bologna - Italy
tel.: +39 051 20 95786

Reply via email to