Re: [slurm-users] what is the possible reason for secondary slurmctld node not allocate job after takeover?

Brian Andrus Fri, 03 Jun 2022 06:17:47 -0700

Offhand, I would suggest double check munge and versions ofslurmd/slurmctld.


Brian Andrus


On 6/3/2022 3:17 AM, taleinterve...@sjtu.edu.cn wrote:

Hi, all:

Our cluster set up 2 slurm control node and scontrol show config as below:

> scontrol show config

…

SlurmctldHost[0] = slurm1

SlurmctldHost[1] = slurm2

StateSaveLocation = /etc/slurm/state

…
Of course we have make sure both node has the some slurm conf andmount the same nfs on StateSaveLocation and can read/write it. (butthere system is different, slurm1 is centos7 and slurm2 is centos8)
When slurm1 control the cluster and slurm2 work in standby mode, thecluster has no problem.
But when we use “scontrol takeover” on slurm2 to switch the primaryrole, we find new-submit jobs all stuck in PD state.
No job will be allocated resource by slurm2, no matter how long wewait. Meanwhile old running jobs can complete without problem, andquery command like “sinfo”, “sacct” all work well.
The pending reason is firstly shown as “priority” in squeue, but afterwe manually update the priority, it become “none” reason and stillstuck in PD state.
During slurm2 primary period, there is no significant error inslurmctld.log. Only after we restart the slurm1 service to let slurm2return to standby role, it report lots of error as:
error: Invalid RPC received MESSAGE_NODE_REGISTRATION_STATUS while instandby mode
error: Invalid RPC received REQUEST_COMPLETE_PROLOG while in standby mode
error: Invalid RPC received REQUEST_COMPLETE_JOB_ALLOCATION while instandby mode
So is there any suggestion to find the reason why slurm2 workabnormally as primary controller?

Re: [slurm-users] what is the possible reason for secondary slurmctld node not allocate job after takeover?

Reply via email to