Offhand, I would suggest double check munge and versions of
slurmd/slurmctld.
Brian Andrus
On 6/3/2022 3:17 AM, taleinterve...@sjtu.edu.cn wrote:
Hi, all:
Our cluster set up 2 slurm control node and scontrol show config as below:
> scontrol show config
…
SlurmctldHost[0] = slurm1
SlurmctldHost[1] = slurm2
StateSaveLocation = /etc/slurm/state
…
Of course we have make sure both node has the some slurm conf and
mount the same nfs on StateSaveLocation and can read/write it. (but
there system is different, slurm1 is centos7 and slurm2 is centos8)
When slurm1 control the cluster and slurm2 work in standby mode, the
cluster has no problem.
But when we use “scontrol takeover” on slurm2 to switch the primary
role, we find new-submit jobs all stuck in PD state.
No job will be allocated resource by slurm2, no matter how long we
wait. Meanwhile old running jobs can complete without problem, and
query command like “sinfo”, “sacct” all work well.
The pending reason is firstly shown as “priority” in squeue, but after
we manually update the priority, it become “none” reason and still
stuck in PD state.
During slurm2 primary period, there is no significant error in
slurmctld.log. Only after we restart the slurm1 service to let slurm2
return to standby role, it report lots of error as:
error: Invalid RPC received MESSAGE_NODE_REGISTRATION_STATUS while in
standby mode
error: Invalid RPC received REQUEST_COMPLETE_PROLOG while in standby mode
error: Invalid RPC received REQUEST_COMPLETE_JOB_ALLOCATION while in
standby mode
So is there any suggestion to find the reason why slurm2 work
abnormally as primary controller?