Hi there, First of all, apologies for the rather verbose email.
Newbie here, wanting to set up a minimal slurm cluster on Debian 12. I installed slurm-wlm (22.05.8) on the head node and slurmd (also 22.05.8) on the compute node via apt. I have one head, one compute node, and one partition. I have written the simplest of jobs (slurm_hello_world.sh): #!/bin/env bash #SBATCH --job-name=hello_word # Job name #SBATCH --output=hello_world_%j.log # Standard output and error log echo "Hello world, I am running on node $HOSTNAME" sleep 5 date Which I try to submit via sbatch slurm_hello_world.sh. $ squeue --long -u $USER Tue Nov 07 08:37:58 2023 JOBID PARTITION NAME USER STATE TIME TIME_LIMI NODES NODELIST(REASON) 7 all_nodes hello_wo myuser PENDING 0:00 UNLIMITED 1 (Nodes required for job are DOWN, DRAINED or reserved for jobs in higher priority partitions) 9 all_nodes hello_wo myuser PENDING 0:00 UNLIMITED 1 (ReqNodeNotAvail, UnavailableNodes:compute-0) sinfo shows that the node is drained (but this node is idle and has no processing) $ sinfo --Node --long Tue Nov 07 08:29:51 2023 NODELIST NODES PARTITION STATE CPUS S:C:T MEMORY TMP_DISK WEIGHT AVAIL_FE REASON compute-0 1 all_nodes* drained 32 2:8:2 60000 0 1 (null) batch job complete f The slurm.conf (exact copy on head and compute nodes) is (mostly commented out stuff) # # Example slurm.conf file. Please run configurator.html # (in doc/html) to build a configuration file customized # for your environment. # # # slurm.conf file generated by configurator.html. # Put this file on all nodes of your cluster. # See the slurm.conf man page for more information. # ClusterName=mycluster SlurmctldHost=head #SlurmctldHost= # #DisableRootJobs=NO #EnforcePartLimits=NO #Epilog= #EpilogSlurmctld= #FirstJobId=1 #MaxJobId=67043328 #GresTypes= #GroupUpdateForce=0 #GroupUpdateTime=600 #JobFileAppend=0 #JobRequeue=1 #JobSubmitPlugins=lua #KillOnBadExit=0 #LaunchType=launch/slurm #Licenses=foo*4,bar #MailProg=/bin/mail #MaxJobCount=10000 #MaxStepCount=40000 #MaxTasksPerNode=512 MpiDefault=none #MpiParams=ports=#-# #PluginDir= #PlugStackConfig= #PrivateData=jobs ProctrackType=proctrack/cgroup #Prolog= #PrologFlags= #PrologSlurmctld= #PropagatePrioProcess=0 #PropagateResourceLimits= #PropagateResourceLimitsExcept= #RebootProgram= ReturnToService=1 SlurmctldPidFile=/var/run/slurm/slurmctld.pid SlurmctldPort=6817 SlurmdPidFile=/var/run/slurm/slurmd.pid SlurmdPort=6818 SlurmdSpoolDir=/var/spool/slurmd SlurmUser=slurm #SlurmdUser=root #SrunEpilog= #SrunProlog= StateSaveLocation=/var/spool/slurmctld SwitchType=switch/none #TaskEpilog= TaskPlugin=task/affinity #TaskProlog= #TopologyPlugin=topology/tree #TmpFS=/tmp #TrackWCKey=no #TreeWidth= #UnkillableStepProgram= #UsePAM=0 # # # TIMERS #BatchStartTimeout=10 #CompleteWait=0 #EpilogMsgTime=2000 #GetEnvTimeout=2 #HealthCheckInterval=0 #HealthCheckProgram= InactiveLimit=0 KillWait=30 #MessageTimeout=10 #ResvOverRun=0 MinJobAge=300 #OverTimeLimit=0 SlurmctldTimeout=120 SlurmdTimeout=300 #UnkillableStepTimeout=60 #VSizeFactor=0 Waittime=0 # # # SCHEDULING #DefMemPerCPU=0 #MaxMemPerCPU=0 #SchedulerTimeSlice=30 SchedulerType=sched/backfill SelectType=select/cons_tres # # # JOB PRIORITY #PriorityFlags= #PriorityType=priority/multifactor #PriorityDecayHalfLife= #PriorityCalcPeriod= #PriorityFavorSmall= #PriorityMaxAge= #PriorityUsageResetPeriod= #PriorityWeightAge= #PriorityWeightFairshare= #PriorityWeightJobSize= #PriorityWeightPartition= #PriorityWeightQOS= # # # LOGGING AND ACCOUNTING #AccountingStorageEnforce=0 #AccountingStorageHost= #AccountingStoragePass= #AccountingStoragePort= AccountingStorageType=accounting_storage/none #AccountingStorageUser= #AccountingStoreFlags= #JobCompHost= #JobCompLoc= #JobCompPass= #JobCompPort= JobCompType=jobcomp/none #JobCompUser= #JobContainerType= JobAcctGatherFrequency=30 JobAcctGatherType=jobacct_gather/none SlurmctldDebug=debug3 SlurmctldLogFile=/var/log/slurm/slurmctld.log SlurmdDebug=debug3 SlurmdLogFile=/var/log/slurm/slurmd.log #SlurmSchedLogFile= #SlurmSchedLogLevel= #DebugFlags= # # # POWER SAVE SUPPORT FOR IDLE NODES (optional) #SuspendProgram= #ResumeProgram= #SuspendTimeout= #ResumeTimeout= #ResumeRate= #SuspendExcNodes= #SuspendExcParts= #SuspendRate= #SuspendTime= # # # COMPUTE NODES NodeName=compute-0 RealMemory=60000 Sockets=2 CoresPerSocket=8 ThreadsPerCore=2 State=UNKNOWN PartitionName=all_nodes Nodes=ALL Default=YES MaxTime=INFINITE State=UP To my untrained eye, there is nothing obviously wrong in slurmd.log (compute) and slurmctld.log (head). In slurmctld.log: [...SNIP...] [2023-11-07T08:58:35.804] debug2: sched: JobId=10. unable to schedule in Partition=all_nodes (per _failed_partition()). Retaining previous scheduling Reason=ReqNodeNotAvail. Desc=ReqNodeNotAvail, UnavailableNodes:compute-0. Priority=4294901753. [2023-11-07T08:58:36.396] debug: sched/backfill: _attempt_backfill: beginning [2023-11-07T08:58:36.396] debug: sched/backfill: _attempt_backfill: 4 jobs to backfill [2023-11-07T08:58:36.652] debug2: Processing RPC: REQUEST_SUBMIT_BATCH_JOB from UID=1002 [2023-11-07T08:58:36.652] debug3: _set_hostname: Using auth hostname for alloc_node: head [2023-11-07T08:58:36.652] debug3: JobDesc: user_id=1002 JobId=N/A partition=(null) name=hello_word [2023-11-07T08:58:36.652] debug3: cpus=1-4294967294 pn_min_cpus=-1 core_spec=-1 [2023-11-07T08:58:36.652] debug3: Nodes=4294967294-[4294967294] Sock/Node=65534 Core/Sock=65534 Thread/Core=65534 [2023-11-07T08:58:36.652] debug3: pn_min_memory_job=18446744073709551615 pn_min_tmp_disk=-1 [2023-11-07T08:58:36.653] debug3: immediate=0 reservation=(null) [2023-11-07T08:58:36.653] debug3: features=(null) batch_features=(null) cluster_features=(null) prefer=(null) [2023-11-07T08:58:36.653] debug3: req_nodes=(null) exc_nodes=(null) [2023-11-07T08:58:36.653] debug3: time_limit=-1--1 priority=-1 contiguous=0 shared=-1 [2023-11-07T08:58:36.653] debug3: kill_on_node_fail=-1 script=#!/bin/env bash #SBATCH --job-name=hello... [2023-11-07T08:58:36.653] debug3: argv="/home/myuser/myuser-slurm/tests/hello_world_slurm.sh" [2023-11-07T08:58:36.653] debug3: environment=SHELL=/bin/bash,LANGUAGE=en_GB:en,EDITOR=vim,... [2023-11-07T08:58:36.653] debug3: stdin=/dev/null stdout=/home/myuser/myuser-slurm/tests/hello_world_%j.log stderr=(null) [2023-11-07T08:58:36.653] debug3: work_dir=/home/myuser/ansible-slurm/tests alloc_node:sid=head:721 [2023-11-07T08:58:36.653] debug3: power_flags= [2023-11-07T08:58:36.653] debug3: resp_host=(null) alloc_resp_port=0 other_port=0 [2023-11-07T08:58:36.653] debug3: dependency=(null) account=(null) qos=(null) comment=(null) [2023-11-07T08:58:36.653] debug3: mail_type=0 mail_user=(null) nice=0 num_tasks=-1 open_mode=0 overcommit=-1 acctg_freq=(null) [2023-11-07T08:58:36.653] debug3: network=(null) begin=Unknown cpus_per_task=-1 requeue=-1 licenses=(null) [2023-11-07T08:58:36.653] debug3: end_time= signal=0@0 wait_all_nodes=-1 cpu_freq= [2023-11-07T08:58:36.653] debug3: ntasks_per_node=-1 ntasks_per_socket=-1 ntasks_per_core=-1 ntasks_per_tres=-1 [2023-11-07T08:58:36.653] debug3: mem_bind=0:(null) plane_size:65534 [2023-11-07T08:58:36.653] debug3: array_inx=(null) [2023-11-07T08:58:36.653] debug3: burst_buffer=(null) [2023-11-07T08:58:36.653] debug3: mcs_label=(null) [2023-11-07T08:58:36.653] debug3: deadline=Unknown [2023-11-07T08:58:36.653] debug3: bitflags=0x1e000000 delay_boot=4294967294 [2023-11-07T08:58:36.654] debug2: found 1 usable nodes from config containing compute-0 [2023-11-07T08:58:36.654] debug3: _pick_best_nodes: JobId=11 idle_nodes 1 share_nodes 1 [2023-11-07T08:58:36.654] debug2: select/cons_tres: select_p_job_test: evaluating JobId=11 [2023-11-07T08:58:36.654] debug2: select/cons_tres: select_p_job_test: evaluating JobId=11 [2023-11-07T08:58:36.654] debug3: select_nodes: JobId=11 required nodes not avail [2023-11-07T08:58:36.654] _slurm_rpc_submit_batch_job: JobId=11 InitPrio=4294901752 usec=822 [2023-11-07T08:58:38.807] debug: sched: Running job scheduler for default depth. [2023-11-07T08:58:38.807] debug3: sched: JobId=7. State=PENDING. Reason=Resources. Priority=4294901756. Partition=all_nodes. [2023-11-07T08:58:38.807] debug2: sched: JobId=8. unable to schedule in Partition=all_nodes (per _failed_partition()). Retaining previous scheduling Reason=ReqNodeNotAvail. Desc=ReqNodeNotAvail, UnavailableNodes:compute-0. Priority=4294901755. [2023-11-07T08:58:38.807] debug2: sched: JobId=9. unable to schedule in Partition=all_nodes (per _failed_partition()). Retaining previous scheduling Reason=ReqNodeNotAvail. Desc=ReqNodeNotAvail, UnavailableNodes:compute-0. Priority=4294901754. [2023-11-07T08:58:38.807] debug2: sched: JobId=10. unable to schedule in Partition=all_nodes (per _failed_partition()). Retaining previous scheduling Reason=ReqNodeNotAvail. Desc=ReqNodeNotAvail, UnavailableNodes:compute-0. Priority=4294901753. [2023-11-07T08:58:38.807] debug2: sched: JobId=11. unable to schedule in Partition=all_nodes (per _failed_partition()). Retaining previous scheduling Reason=ReqNodeNotAvail. Desc=ReqNodeNotAvail, UnavailableNodes:compute-0. Priority=4294901752. [2023-11-07T08:58:39.008] debug3: create_mmap_buf: loaded file `/var/spool/slurmctld/job_state` as buf_t [2023-11-07T08:58:39.008] debug3: Writing job id 12 to header record of job_state file Can you help me figure out what is wrong with my setup please? Many thanks Jean-Paul Ebejer University of Malta -- *The contents of this email are subject to *these terms <https://www.um.edu.mt/disclaimer/email/>.**