I built/ran a quick test on older slurm and do see the issue. Looks like
a possible bug. I would open a bug with SchedMD.
I couldn't think of a good work-around, since the job would get
rescheduled to a different node if you reboot, even if you have the node
update it's own status at boot. It
> Ah. Looks like the --reboot option is telling slurmctld to put them in
the CF state and wait for them to come back up. Slurmctld then waits for
them to 'disconnect' and come back. Since they never reboot (therefore
never disconnect), slurmctld keeps them in the CF state until the timeout
occurs.
Ah. Looks like the --reboot option is telling slurmctld to put them in
the CF state and wait for them to come back up. Slurmctld then waits for
them to 'disconnect' and come back. Since they never reboot (therefore
never disconnect), slurmctld keeps them in the CF state until the
timeout occurs
Hi Brian
The nodes work with slurm without any issues till I try the "--reboot"
option.
I can successfully allocate the nodes or any other slurm related operation
> You may want to double check that the node is actually rebooting and
that slurmd is set to start on boot.
That's the problem, they ar
You may want to double check that the node is actually rebooting and
that slurmd is set to start on boot.
ResumeTimeoutReached, in a nutshell, means slurmd isn't talking to
slurmctld.
Are you able to log onto the node itself and see that it has rebooted?
If so, try doing something like 'sinfo'
Hi all
I'm trying to use the --reboot option of srun to reboot the nodes before
allocation.
However the nodes not been rebooted
The node get's stuck in allocated# state as show by sinfo or CF - as shown
by squeue
The logs of slurmctld and slurmd show no relevant information, debug levels
at "debu