slurmctld HA ; backup controller doesn't schedule and start any job Hi all, I am trying a slurmctld HA configuration on two servers, using slurm version 22.05.9 of AlmaLinux 9.4.
The problem is, after stopping the primary slurmctld and slurmdbd, when I submit a job with sbatch while backup slurmctld and slurmdbd are running, the job will be pending with Reason=None, and it will not be scheduled and will not start running. $ squeue JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON) 43 cpu_multi job hpc PD 0:00 1 (None) Why won't my job start running? What should I change to get the job to start running? The slurmctld.log and configuration files are shown below. Backup's slurmctld.logļ¼ [2025-04-09T15:31:17.000] debug3: Heartbeat at 1744180276 [2025-04-09T15:31:18.000] debug3: Heartbeat at 1744180277 [2025-04-09T15:31:19.022] debug3: Heartbeat at 1744180279 [2025-04-09T15:31:19.605] debug2: Processing RPC: REQUEST_SUBMIT_BATCH_JOB from UID=1000 [2025-04-09T15:31:19.605] debug3: _set_hostname: Using auth hostname for alloc_node: compute1 [2025-04-09T15:31:19.605] debug3: JobDesc: user_id=1000 JobId=N/A partition=cpu_multi name=job [2025-04-09T15:31:19.605] debug3: cpus=2-4294967294 pn_min_cpus=1 core_spec=-1 [2025-04-09T15:31:19.605] debug3: Nodes=1-[1] Sock/Node=65534 Core/Sock=65534 Thread/Core=65534 [2025-04-09T15:31:19.605] debug3: pn_min_memory_job=18446744073709551615 pn_min_tmp_disk=-1 [2025-04-09T15:31:19.605] debug3: immediate=0 reservation=(null) [2025-04-09T15:31:19.605] debug3: features=(null) batch_features=(null) cluster_features=(null) prefer=(null) [2025-04-09T15:31:19.605] debug3: req_nodes=(null) exc_nodes=(null) [2025-04-09T15:31:19.605] debug3: time_limit=-1--1 priority=-1 contiguous=0 shared=-1 [2025-04-09T15:31:19.605] debug3: kill_on_node_fail=-1 script=#! /bin/bash #SBATCH -p cpu_multi #SBATC... [2025-04-09T15:31:19.605] debug3: argv="./twocore.sh" [2025-04-09T15:31:19.605] debug3: environment=SHELL=/bin/bash,PYENV_SHELL=bash,HISTCONTROL=ignoredups,... [2025-04-09T15:31:19.605] debug3: stdin=/dev/null stdout=/misc/home/hpc/slurmtest/twocore_%J.out stderr=(null) [2025-04-09T15:31:19.605] debug3: work_dir=/misc/home/hpc/slurmtest alloc_node:sid=compute1:281600 [2025-04-09T15:31:19.605] debug3: power_flags= [2025-04-09T15:31:19.605] debug3: resp_host=(null) alloc_resp_port=0 other_port=0 [2025-04-09T15:31:19.605] debug3: dependency=(null) account=(null) qos=(null) comment=(null) [2025-04-09T15:31:19.605] debug3: mail_type=0 mail_user=(null) nice=0 num_tasks=2 open_mode=0 overcommit=-1 acctg_freq=(null) [2025-04-09T15:31:19.605] debug3: network=(null) begin=Unknown cpus_per_task=1 requeue=-1 licenses=(null) [2025-04-09T15:31:19.605] debug3: end_time= signal=0@0 wait_all_nodes=-1 cpu_freq= [2025-04-09T15:31:19.605] debug3: ntasks_per_node=-1 ntasks_per_socket=-1 ntasks_per_core=-1 ntasks_per_tres=-1 [2025-04-09T15:31:19.605] debug3: mem_bind=0:(null) plane_size:65534 [2025-04-09T15:31:19.605] debug3: array_inx=(null) [2025-04-09T15:31:19.605] debug3: burst_buffer=(null) [2025-04-09T15:31:19.605] debug3: mcs_label=(null) [2025-04-09T15:31:19.605] debug3: deadline=Unknown [2025-04-09T15:31:19.605] debug3: bitflags=0x1a00c000 delay_boot=4294967294 [2025-04-09T15:31:19.605] debug3: assoc_mgr_fill_in_user: found correct user: hpc(1000) [2025-04-09T15:31:19.605] debug5: assoc_mgr_fill_in_assoc: looking for assoc of user=hpc(1000), acct=hpc, cluster=cluster, partition=cpu_multi [2025-04-09T15:31:19.605] debug3: assoc_mgr_fill_in_assoc: found correct association of user=hpc(1000), acct=hpc, cluster=cluster, partition=cpu_multi to assoc=16 acct=hpc [2025-04-09T15:31:19.605] debug3: found correct qos [2025-04-09T15:31:19.607] debug2: priority/multifactor: priority_p_set: initial priority for job 44 is 33 [2025-04-09T15:31:19.607] debug2: found 1 usable nodes from config containing compute1 [2025-04-09T15:31:19.607] debug2: found 1 usable nodes from config containing compute2 [2025-04-09T15:31:19.607] debug3: _pick_best_nodes: JobId=44 idle_nodes 2 share_nodes 2 [2025-04-09T15:31:19.607] debug2: select/cons_tres: select_p_job_test: evaluating JobId=44 [2025-04-09T15:31:19.607] debug2: sched: JobId=44 allocated resources: NodeList=(null) [2025-04-09T15:31:19.607] _slurm_rpc_submit_batch_job: JobId=44 InitPrio=33 usec=2490 [2025-04-09T15:31:19.608] debug3: create_mmap_buf: loaded file `/var/spool/slurm/ctld/job_state` as buf_t [2025-04-09T15:31:19.609] debug3: Writing job id 45 to header record of job_state file [2025-04-09T15:31:21.000] debug3: Heartbeat at 1744180280 [2025-04-09T15:31:21.257] debug2: _slurm_connect: failed to connect to 192.168.56.11:6817: Connection refused [2025-04-09T15:31:21.257] debug2: Error connecting slurm stream socket at 192.168.56.11:6817: Connection refused [2025-04-09T15:31:22.000] debug3: Heartbeat at 1744180282 [2025-04-09T15:31:24.000] debug3: Heartbeat at 1744180283 [2025-04-09T15:31:25.004] debug3: Heartbeat at 1744180285 [2025-04-09T15:31:27.001] debug3: Heartbeat at 1744180287 [2025-04-09T15:31:29.000] debug3: Heartbeat at 1744180288 [2025-04-09T15:31:30.000] debug3: Heartbeat at 1744180289 [2025-04-09T15:31:31.006] debug3: Heartbeat at 1744180291 [2025-04-09T15:31:32.822] debug2: _slurm_connect: failed to connect to 192.168.56.11:6817: Connection refused [2025-04-09T15:31:32.822] debug2: Error connecting slurm stream socket at 192.168.56.11:6817: Connection refused [2025-04-09T15:31:33.007] debug3: Heartbeat at 1744180293 [2025-04-09T15:31:35.000] debug3: Heartbeat at 1744180294 [2025-04-09T15:31:36.002] debug3: Heartbeat at 1744180296 [2025-04-09T15:31:36.395] debug2: select/cons_tres: select_p_job_test: evaluating JobId=43 [2025-04-09T15:31:36.395] debug2: select/cons_tres: select_p_job_test: evaluating JobId=44 [2025-04-09T15:31:38.000] debug3: Heartbeat at 1744180297 [2025-04-09T15:31:38.497] debug2: Performing purge of old job records [2025-04-09T15:31:39.000] debug3: Heartbeat at 1744180298 [2025-04-09T15:31:40.000] debug3: Heartbeat at 1744180300 [2025-04-09T15:31:40.655] debug2: Testing job time limits and checkpoints [2025-04-09T15:31:42.000] debug3: Heartbeat at 1744180301 [2025-04-09T15:31:43.000] debug3: Heartbeat at 1744180302 slurm.conf: ClusterName=cluster SlurmctldHost=gateway1 #Primary(192.168.56.11) SlurmctldHost=gateway2 #Backup(192.168.56.12) MpiDefault=pmix ProctrackType=proctrack/cgroup PrologFlags=Contain ReturnToService=0 SlurmctldPidFile=/var/run/slurm/slurmctld.pid SlurmctldPort=6817 SlurmdPidFile=/var/run/slurm/slurmd.pid SlurmdPort=6818 SlurmdSpoolDir=/var/spool/slurm/d SlurmUser=slurm StateSaveLocation=/var/spool/slurm/ctld SwitchType=switch/none TaskEpilog=/etc/slurm/taskepilog.sh TaskPlugin=task/cgroup,task/affinity TaskProlog=/etc/slurm/taskprolog.sh InactiveLimit=0 KillWait=30 MinJobAge=300 SlurmctldTimeout=10 SlurmdTimeout=300 Waittime=0 DefMemPerCPU=32 SchedulerType=sched/builtin SelectType=select/cons_tres SelectTypeParameters=CR_Core_Memory PriorityType=priority/multifactor PriorityWeightPartition=1000 AccountingStorageHost=gateway1 AccountingStorageBackupHost=gateway2 AccountingStorageType=accounting_storage/slurmdbd AccountingStoreFlags=job_comment JobCompType=jobcomp/none JobAcctGatherFrequency=30 JobAcctGatherType=jobacct_gather/cgroup SlurmctldDebug=debug5 SlurmctldLogFile=/var/log/slurm/slurmctld.log SlurmdDebug=info SlurmdLogFile=/var/log/slurm/slurmd.log NodeName=compute1 CPUs=4 Boards=1 SocketsPerBoard=1 CoresPerSocket=4 ThreadsPerCore=1 RealMemory=3900 Weight=1 NodeName=compute2 CPUs=4 Boards=1 SocketsPerBoard=1 CoresPerSocket=4 ThreadsPerCore=1 RealMemory=3900 Weight=1 PartitionName=gpu_single Nodes=ALL PriorityJobFactor=30 MaxTime=INFINITE State=UP Default=YES PartitionName=cpu_single Nodes=ALL PriorityJobFactor=10 MaxTime=INFINITE State=UP PartitionName=cpu_multi Nodes=ALL MaxTime=INFINITE State=UP slurmdbd.conf: AuthType=auth/munge DebugLevel=4 DbdHost=gateway1 DbdBackupHost=gateway2 LogFile=/var/log/slurm/slurmdbd.log PidFile=/var/run/slurm/slurmdbd.pid PurgeEventAfter=1month PurgeJobAfter=1month PurgeResvAfter=1month PurgeStepAfter=1month PurgeSuspendAfter=1month PurgeTXNAfter=1month PurgeUsageAfter=1month SlurmUser=slurm StorageType=accounting_storage/mysql StorageHost=gateway1 StoragePass=mypassword StorageUser=slurm Best regards, Hiro -- slurm-users mailing list -- slurm-users@lists.schedmd.com To unsubscribe send an email to slurm-users-le...@lists.schedmd.com