Tks. Checked it: it's on the home filesystem, NFS-shared between the
nodes. Well, actually a bit more involved than that: JobCompLoc points
to /var/spool/jobscompleted.txt but /var/spool/slurm is actually a
symlink to /home/conf/slurm_spool .
root@str957-cluster:/# grep spool /etc/slurm.conf
JobCompLoc=/var/spool/slurm/jobscompleted.txt
root@str957-cluster:/# ls -l /var/spool/
[...]
lrwxrwxrwx 1 root root 22 apr 16 08:12 slurm ->
/home/conf/slurm_spool
The symlinks are on both nodes and the home is mounted.
When can the jobscompleted.txt file be removed? Maybe some weird
character slipped in and it messes the parsing? Can I test it?
Il 20/09/2021 12:33, mercan ha scritto:
Hi;
Please check the StateSaveLocation directory which should readable and
writable by both slurmctld nodes and it should be a shared directory,
not two local directory.
The explanation at below is taken from slurm web site:
"The backup controller recovers state information from the
StateSaveLocation directory, which must be readable and writable from
both the primary and backup controllers."
Regards;
Ahmet M.
20.09.2021 12:08 tarihinde Diego Zuccato yazdı:
Hello all.
After summer break, I noticed that rebooting one of the two slurmctld
nodes kills & requeues all running jobs. Before the break it did not
impact running jobs and nobody changed config during the break... Duh?
Today I just restarted slurmctld and slurmd: same kill&requeue.
I'm currently in the process of adding some nodes, but I already did
it other times w/ no issues (actually the second slurmctld node have
been installed to catch the race of a job terminating while the main
slurmctld was shut down).
Anything I should double-check?
Tks.
--
Diego Zuccato
DIFA - Dip. di Fisica e Astronomia
Servizi Informatici
Alma Mater Studiorum - Università di Bologna
V.le Berti-Pichat 6/2 - 40127 Bologna - Italy
tel.: +39 051 20 95786