[slurm-users] Unable to submit job (ReqNodeNotAvail, UnavailableNodes)

JP Ebejer Tue, 07 Nov 2023 01:15:05 -0800

Hi there,

First of all, apologies for the rather verbose email.


Newbie here, wanting to set up a minimal slurm cluster on Debian 12.  I
installed slurm-wlm (22.05.8) on the head node and slurmd (also 22.05.8) on
the compute node via apt. I have one head, one compute node, and one
partition.

I have written the simplest of jobs (slurm_hello_world.sh):

#!/bin/env bash
#SBATCH --job-name=hello_word    # Job name
#SBATCH --output=hello_world_%j.log   # Standard output and error log

echo "Hello world, I am running on node $HOSTNAME"
sleep 5
date

Which I try to submit via sbatch slurm_hello_world.sh.

$ squeue --long -u $USER
Tue Nov 07 08:37:58 2023
             JOBID PARTITION     NAME     USER    STATE       TIME
TIME_LIMI  NODES NODELIST(REASON)
                 7 all_nodes hello_wo  myuser  PENDING       0:00 UNLIMITED
     1 (Nodes required for job are DOWN, DRAINED or reserved for jobs in
higher priority partitions)
                 9 all_nodes hello_wo  myuser  PENDING       0:00 UNLIMITED
     1 (ReqNodeNotAvail, UnavailableNodes:compute-0)

sinfo shows that the node is drained (but this node is idle and has no
processing)

$ sinfo --Node --long
Tue Nov 07 08:29:51 2023
NODELIST   NODES  PARTITION       STATE CPUS    S:C:T MEMORY TMP_DISK
WEIGHT AVAIL_FE REASON
compute-0        1 all_nodes*     drained 32      2:8:2  60000        0
 1   (null) batch job complete f


The slurm.conf (exact copy on head and compute nodes) is (mostly commented
out stuff)

#
# Example slurm.conf file. Please run configurator.html
# (in doc/html) to build a configuration file customized
# for your environment.
#
#
# slurm.conf file generated by configurator.html.
# Put this file on all nodes of your cluster.
# See the slurm.conf man page for more information.
#
ClusterName=mycluster
SlurmctldHost=head
#SlurmctldHost=
#
#DisableRootJobs=NO
#EnforcePartLimits=NO
#Epilog=
#EpilogSlurmctld=
#FirstJobId=1
#MaxJobId=67043328
#GresTypes=
#GroupUpdateForce=0
#GroupUpdateTime=600
#JobFileAppend=0
#JobRequeue=1
#JobSubmitPlugins=lua
#KillOnBadExit=0
#LaunchType=launch/slurm
#Licenses=foo*4,bar
#MailProg=/bin/mail
#MaxJobCount=10000
#MaxStepCount=40000
#MaxTasksPerNode=512
MpiDefault=none
#MpiParams=ports=#-#
#PluginDir=
#PlugStackConfig=
#PrivateData=jobs
ProctrackType=proctrack/cgroup
#Prolog=
#PrologFlags=
#PrologSlurmctld=
#PropagatePrioProcess=0
#PropagateResourceLimits=
#PropagateResourceLimitsExcept=
#RebootProgram=
ReturnToService=1
SlurmctldPidFile=/var/run/slurm/slurmctld.pid
SlurmctldPort=6817
SlurmdPidFile=/var/run/slurm/slurmd.pid
SlurmdPort=6818
SlurmdSpoolDir=/var/spool/slurmd
SlurmUser=slurm
#SlurmdUser=root
#SrunEpilog=
#SrunProlog=
StateSaveLocation=/var/spool/slurmctld
SwitchType=switch/none
#TaskEpilog=
TaskPlugin=task/affinity
#TaskProlog=
#TopologyPlugin=topology/tree
#TmpFS=/tmp
#TrackWCKey=no
#TreeWidth=
#UnkillableStepProgram=
#UsePAM=0
#
#
# TIMERS
#BatchStartTimeout=10
#CompleteWait=0
#EpilogMsgTime=2000
#GetEnvTimeout=2
#HealthCheckInterval=0
#HealthCheckProgram=
InactiveLimit=0
KillWait=30
#MessageTimeout=10
#ResvOverRun=0
MinJobAge=300
#OverTimeLimit=0
SlurmctldTimeout=120
SlurmdTimeout=300
#UnkillableStepTimeout=60
#VSizeFactor=0
Waittime=0
#
#
# SCHEDULING
#DefMemPerCPU=0
#MaxMemPerCPU=0
#SchedulerTimeSlice=30
SchedulerType=sched/backfill
SelectType=select/cons_tres
#
#
# JOB PRIORITY
#PriorityFlags=
#PriorityType=priority/multifactor
#PriorityDecayHalfLife=
#PriorityCalcPeriod=
#PriorityFavorSmall=
#PriorityMaxAge=
#PriorityUsageResetPeriod=
#PriorityWeightAge=
#PriorityWeightFairshare=
#PriorityWeightJobSize=
#PriorityWeightPartition=
#PriorityWeightQOS=
#
#
# LOGGING AND ACCOUNTING
#AccountingStorageEnforce=0
#AccountingStorageHost=
#AccountingStoragePass=
#AccountingStoragePort=
AccountingStorageType=accounting_storage/none
#AccountingStorageUser=
#AccountingStoreFlags=
#JobCompHost=
#JobCompLoc=
#JobCompPass=
#JobCompPort=
JobCompType=jobcomp/none
#JobCompUser=
#JobContainerType=
JobAcctGatherFrequency=30
JobAcctGatherType=jobacct_gather/none
SlurmctldDebug=debug3
SlurmctldLogFile=/var/log/slurm/slurmctld.log
SlurmdDebug=debug3
SlurmdLogFile=/var/log/slurm/slurmd.log
#SlurmSchedLogFile=
#SlurmSchedLogLevel=
#DebugFlags=
#
#
# POWER SAVE SUPPORT FOR IDLE NODES (optional)
#SuspendProgram=
#ResumeProgram=
#SuspendTimeout=
#ResumeTimeout=
#ResumeRate=
#SuspendExcNodes=
#SuspendExcParts=
#SuspendRate=
#SuspendTime=
#
#
# COMPUTE NODES
NodeName=compute-0 RealMemory=60000 Sockets=2 CoresPerSocket=8
ThreadsPerCore=2 State=UNKNOWN
PartitionName=all_nodes Nodes=ALL Default=YES MaxTime=INFINITE State=UP


To my untrained eye, there is nothing obviously wrong in slurmd.log
(compute) and slurmctld.log (head). In slurmctld.log:

[...SNIP...]
[2023-11-07T08:58:35.804] debug2: sched: JobId=10. unable to schedule in
Partition=all_nodes (per _failed_partition()). Retaining previous
scheduling Reason=ReqNodeNotAvail. Desc=ReqNodeNotAvail,
UnavailableNodes:compute-0. Priority=4294901753.
[2023-11-07T08:58:36.396] debug:  sched/backfill: _attempt_backfill:
beginning
[2023-11-07T08:58:36.396] debug:  sched/backfill: _attempt_backfill: 4 jobs
to backfill
[2023-11-07T08:58:36.652] debug2: Processing RPC: REQUEST_SUBMIT_BATCH_JOB
from UID=1002
[2023-11-07T08:58:36.652] debug3: _set_hostname: Using auth hostname for
alloc_node: head
[2023-11-07T08:58:36.652] debug3: JobDesc: user_id=1002 JobId=N/A
partition=(null) name=hello_word
[2023-11-07T08:58:36.652] debug3:    cpus=1-4294967294 pn_min_cpus=-1
core_spec=-1
[2023-11-07T08:58:36.652] debug3:    Nodes=4294967294-[4294967294]
Sock/Node=65534 Core/Sock=65534 Thread/Core=65534
[2023-11-07T08:58:36.652] debug3:    pn_min_memory_job=18446744073709551615
pn_min_tmp_disk=-1
[2023-11-07T08:58:36.653] debug3:    immediate=0 reservation=(null)
[2023-11-07T08:58:36.653] debug3:    features=(null) batch_features=(null)
cluster_features=(null) prefer=(null)
[2023-11-07T08:58:36.653] debug3:    req_nodes=(null) exc_nodes=(null)
[2023-11-07T08:58:36.653] debug3:    time_limit=-1--1 priority=-1
contiguous=0 shared=-1
[2023-11-07T08:58:36.653] debug3:    kill_on_node_fail=-1 script=#!/bin/env
bash
#SBATCH --job-name=hello...
[2023-11-07T08:58:36.653] debug3:
 argv="/home/myuser/myuser-slurm/tests/hello_world_slurm.sh"
[2023-11-07T08:58:36.653] debug3:
 environment=SHELL=/bin/bash,LANGUAGE=en_GB:en,EDITOR=vim,...
[2023-11-07T08:58:36.653] debug3:    stdin=/dev/null
stdout=/home/myuser/myuser-slurm/tests/hello_world_%j.log stderr=(null)
[2023-11-07T08:58:36.653] debug3:
 work_dir=/home/myuser/ansible-slurm/tests alloc_node:sid=head:721
[2023-11-07T08:58:36.653] debug3:    power_flags=
[2023-11-07T08:58:36.653] debug3:    resp_host=(null) alloc_resp_port=0
other_port=0
[2023-11-07T08:58:36.653] debug3:    dependency=(null) account=(null)
qos=(null) comment=(null)
[2023-11-07T08:58:36.653] debug3:    mail_type=0 mail_user=(null) nice=0
num_tasks=-1 open_mode=0 overcommit=-1 acctg_freq=(null)
[2023-11-07T08:58:36.653] debug3:    network=(null) begin=Unknown
cpus_per_task=-1 requeue=-1 licenses=(null)
[2023-11-07T08:58:36.653] debug3:    end_time= signal=0@0 wait_all_nodes=-1
cpu_freq=
[2023-11-07T08:58:36.653] debug3:    ntasks_per_node=-1
ntasks_per_socket=-1 ntasks_per_core=-1 ntasks_per_tres=-1
[2023-11-07T08:58:36.653] debug3:    mem_bind=0:(null) plane_size:65534
[2023-11-07T08:58:36.653] debug3:    array_inx=(null)
[2023-11-07T08:58:36.653] debug3:    burst_buffer=(null)
[2023-11-07T08:58:36.653] debug3:    mcs_label=(null)
[2023-11-07T08:58:36.653] debug3:    deadline=Unknown
[2023-11-07T08:58:36.653] debug3:    bitflags=0x1e000000
delay_boot=4294967294
[2023-11-07T08:58:36.654] debug2: found 1 usable nodes from config
containing compute-0
[2023-11-07T08:58:36.654] debug3: _pick_best_nodes: JobId=11 idle_nodes 1
share_nodes 1
[2023-11-07T08:58:36.654] debug2: select/cons_tres: select_p_job_test:
evaluating JobId=11
[2023-11-07T08:58:36.654] debug2: select/cons_tres: select_p_job_test:
evaluating JobId=11
[2023-11-07T08:58:36.654] debug3: select_nodes: JobId=11 required nodes not
avail
[2023-11-07T08:58:36.654] _slurm_rpc_submit_batch_job: JobId=11
InitPrio=4294901752 usec=822
[2023-11-07T08:58:38.807] debug:  sched: Running job scheduler for default
depth.
[2023-11-07T08:58:38.807] debug3: sched: JobId=7. State=PENDING.
Reason=Resources. Priority=4294901756. Partition=all_nodes.
[2023-11-07T08:58:38.807] debug2: sched: JobId=8. unable to schedule in
Partition=all_nodes (per _failed_partition()). Retaining previous
scheduling Reason=ReqNodeNotAvail. Desc=ReqNodeNotAvail,
UnavailableNodes:compute-0. Priority=4294901755.
[2023-11-07T08:58:38.807] debug2: sched: JobId=9. unable to schedule in
Partition=all_nodes (per _failed_partition()). Retaining previous
scheduling Reason=ReqNodeNotAvail. Desc=ReqNodeNotAvail,
UnavailableNodes:compute-0. Priority=4294901754.
[2023-11-07T08:58:38.807] debug2: sched: JobId=10. unable to schedule in
Partition=all_nodes (per _failed_partition()). Retaining previous
scheduling Reason=ReqNodeNotAvail. Desc=ReqNodeNotAvail,
UnavailableNodes:compute-0. Priority=4294901753.
[2023-11-07T08:58:38.807] debug2: sched: JobId=11. unable to schedule in
Partition=all_nodes (per _failed_partition()). Retaining previous
scheduling Reason=ReqNodeNotAvail. Desc=ReqNodeNotAvail,
UnavailableNodes:compute-0. Priority=4294901752.
[2023-11-07T08:58:39.008] debug3: create_mmap_buf: loaded file
`/var/spool/slurmctld/job_state` as buf_t
[2023-11-07T08:58:39.008] debug3: Writing job id 12 to header record of
job_state file

Can you help me figure out what is wrong with my setup please?

Many thanks
Jean-Paul Ebejer
University of Malta

-- 
*The contents of this email are subject to *these terms 
<https://www.um.edu.mt/disclaimer/email/>.**

[slurm-users] Unable to submit job (ReqNodeNotAvail, UnavailableNodes)

Reply via email to