Hello, I'm a bit lost trying to make a single ec2 node work. Would appreciate your help!
I have a single aws/ec2 CLOUD node, named slurm-node0. The instance is currently down: # aws-ec2-list-instances | grep slurm-node0 i-006506267531a0511 slurm-node0 stopped I'm doing an allocation with salloc. I expect the ResumeProgram to get called, it is not. I assume it's related to the various Time and Timeout values ... Some info: # slurmctld -V slurm 17.11.3-2 # sinfo PARTITION AVAIL TIMELIMIT NODES STATE NODELIST cloud* up infinite 1 down* slurm-node0 # sinfo -Nle Tue Mar 13 15:51:06 2018 NODELIST NODES PARTITION STATE CPUS S:C:T MEMORY TMP_DISK WEIGHT AVAIL_FE REASON slurm-node0 1 cloud* down* 1 1:1:1 1 0 1 cloud Not responding My config file: ClusterName=slurm-aws ControlMachine=gsi-db-srv SlurmUser=slurm SlurmctldPort=6817 SlurmdPort=6818 AuthType=auth/munge StateSaveLocation=/var/spool/slurm/ctld SlurmdSpoolDir=/var/spool/slurm/d SwitchType=switch/none MpiDefault=none SlurmctldPidFile=/var/run/slurmctld.pid SlurmdPidFile=/var/run/slurmd.pid ProctrackType=proctrack/linuxproc ReturnToService=1 TreeWidth=128 SlurmctldTimeout=120 SlurmdTimeout=300 InactiveLimit=0 MinJobAge=300 KillWait=30 Waittime=0 SchedulerType=sched/builtin SelectType=select/linear FastSchedule=1 PriorityType=priority/basic SlurmctldDebug=10 SlurmctldLogFile=/var/log/slurmctld.log SlurmdDebug=10 SlurmdLogFile=/var/log/slurmd.log JobCompType=jobcomp/none ResumeTimeout=360 SuspendTimeout=300 SuspendTime=600 SuspendProgram=/usr/share/slurm-gsi/bin/suspend_program ResumeProgram=/usr/share/slurm-gsi/bin/resume_program NodeName=slurm-node0 State=CLOUD Feature=cloud Weight=1 PartitionName=cloud Nodes=slurm-node0 Default=yes State=up Some of my slurmctld.log: [2018-03-13T15:38:16.386] debug3: Writing job id 35 to header record of job_state file [2018-03-13T15:38:16.556] debug2: sched: Processing RPC: REQUEST_RESOURCE_ALLOCATION from uid=0 [2018-03-13T15:38:16.556] debug3: JobDesc: user_id=0 job_id=N/A partition=(null) name=bash [2018-03-13T15:38:16.556] debug3: cpus=1-4294967294 pn_min_cpus=-1 core_spec=-1 [2018-03-13T15:38:16.556] debug3: Nodes=1-[4294967294] Sock/Node=65534 Core/Sock=65534 Thread/Core=65534 [2018-03-13T15:38:16.556] debug3: pn_min_memory_job=18446744073709551615 pn_min_tmp_disk=-1 [2018-03-13T15:38:16.556] debug3: immediate=0 reservation=(null) [2018-03-13T15:38:16.556] debug3: features=(null) cluster_features=(null) [2018-03-13T15:38:16.556] debug3: req_nodes=(null) exc_nodes=(null) gres=(null) [2018-03-13T15:38:16.556] debug3: time_limit=-1--1 priority=-1 contiguous=0 shared=-1 [2018-03-13T15:38:16.556] debug3: kill_on_node_fail=-1 script=(null) [2018-03-13T15:38:16.556] debug3: stdin=(null) stdout=(null) stderr=(null) [2018-03-13T15:38:16.556] debug3: work_dir=/tmp alloc_node:sid=gsi-db-srv:30214 [2018-03-13T15:38:16.556] debug3: power_flags= [2018-03-13T15:38:16.556] debug3: resp_host=172.31.38.230 alloc_resp_port=41716 other_port=46570 [2018-03-13T15:38:16.556] debug3: dependency=(null) account=(null) qos=(null) comment=(null) [2018-03-13T15:38:16.556] debug3: mail_type=0 mail_user=(null) nice=0 num_tasks=-1 open_mode=0 overcommit=-1 acctg_freq=(null) [2018-03-13T15:38:16.556] debug3: network=(null) begin=Unknown cpus_per_task=-1 requeue=-1 licenses=(null) [2018-03-13T15:38:16.556] debug3: end_time= signal=0@0 wait_all_nodes=-1 cpu_freq= [2018-03-13T15:38:16.556] debug3: ntasks_per_node=-1 ntasks_per_socket=-1 ntasks_per_core=-1 [2018-03-13T15:38:16.556] debug3: mem_bind=65534:(null) plane_size:65534 [2018-03-13T15:38:16.556] debug3: array_inx=(null) [2018-03-13T15:38:16.556] debug3: burst_buffer=(null) [2018-03-13T15:38:16.556] debug3: mcs_label=(null) [2018-03-13T15:38:16.556] debug3: deadline=Unknown [2018-03-13T15:38:16.556] debug3: bitflags=0 delay_boot=4294967294 [2018-03-13T15:38:16.557] debug3: before alteration asking for nodes 1-4294967294 cpus 1-4294967294 [2018-03-13T15:38:16.557] debug3: after alteration asking for nodes 1-4294967294 cpus 1-4294967294 [2018-03-13T15:38:16.557] debug2: found 1 usable nodes from config containing slurm-node0 [2018-03-13T15:38:16.557] debug3: _pick_best_nodes: job 36 idle_nodes 1 share_nodes 1 [2018-03-13T15:38:16.557] debug5: powercapping: checking job 36 : skipped, not eligible [2018-03-13T15:38:16.557] debug3: JobId=36 required nodes not avail [2018-03-13T15:38:16.557] sched: _slurm_rpc_allocate_resources JobId=36 NodeList=(null) usec=521 [2018-03-13T15:38:17.398] debug2: Testing job time limits and checkpoints [2018-03-13T15:38:19.401] debug: Spawning registration agent for slurm-node0 1 hosts [2018-03-13T15:38:19.401] debug2: Spawning RPC agent for msg_type REQUEST_NODE_REGISTRATION_STATUS [2018-03-13T15:38:19.401] debug: sched: Running job scheduler [2018-03-13T15:38:19.401] debug3: sched: JobId=35. State=PENDING. Reason=Resources. Priority=4294901754. Partition=cloud. [2018-03-13T15:38:19.401] debug3: sched: JobId=36. State=PENDING. Reason=Resources. Priority=4294901753. Partition=cloud. [2018-03-13T15:38:19.401] debug2: got 1 threads to send out [2018-03-13T15:38:19.401] debug2: Tree head got back 0 looking for 1 [2018-03-13T15:38:19.401] debug3: Tree sending to slurm-node0 [2018-03-13T15:38:21.004] debug3: Writing job id 36 to header record of job_state file [2018-03-13T15:38:21.401] debug2: slurm_connect poll timeout: Connection timed out [2018-03-13T15:38:21.401] debug2: Error connecting slurm stream socket at 172.31.38.99:6818: Connection timed out [2018-03-13T15:38:21.402] debug3: problems with slurm-node0 [2018-03-13T15:38:21.402] debug2: Tree head got back 1 [2018-03-13T15:38:47.434] debug2: Testing job time limits and checkpoints [2018-03-13T15:39:16.469] debug2: Performing purge of old job records [2018-03-13T15:39:16.469] debug: sched: Running job scheduler [2018-03-13T15:39:16.469] debug3: sched: JobId=35. State=PENDING. Reason=Resources. Priority=4294901754. Partition=cloud. [2018-03-13T15:39:16.469] debug3: sched: JobId=36. State=PENDING. Reason=Resources. Priority=4294901753. Partition=cloud. [2018-03-13T15:39:17.470] debug2: Testing job time limits and checkpoints [2018-03-13T15:39:47.506] debug2: Testing job time limits and checkpoints [2018-03-13T15:39:59.520] debug: Spawning registration agent for slurm-node0 1 hosts [2018-03-13T15:39:59.520] debug2: Spawning RPC agent for msg_type REQUEST_NODE_REGISTRATION_STATUS [2018-03-13T15:39:59.521] debug2: got 1 threads to send out [2018-03-13T15:39:59.521] debug2: Tree head got back 0 looking for 1 [2018-03-13T15:39:59.521] debug3: Tree sending to slurm-node0 [2018-03-13T15:40:01.521] debug2: slurm_connect poll timeout: Connection timed out [2018-03-13T15:40:01.521] debug2: Error connecting slurm stream socket at 172.31.38.99:6818: Connection timed out [2018-03-13T15:40:01.521] debug3: problems with slurm-node0 [2018-03-13T15:40:01.521] debug2: Tree head got back 1 [2018-03-13T15:40:09.788] debug3: Processing RPC: REQUEST_NODE_INFO from uid=0 [2018-03-13T15:40:09.790] debug2: Processing RPC: REQUEST_PARTITION_INFO uid=0 [2018-03-13T15:40:09.790] debug2: _slurm_rpc_dump_partitions, size=184 usec=46 [2018-03-13T15:40:14.539] debug4: No backup slurmctld to ping [2018-03-13T15:40:16.541] debug2: Performing purge of old job records [2018-03-13T15:40:16.541] debug: sched: Running job scheduler [2018-03-13T15:40:16.542] debug3: sched: JobId=35. State=PENDING. Reason=Resources. Priority=4294901754. Partition=cloud. [2018-03-13T15:40:16.542] debug3: sched: JobId=36. State=PENDING. Reason=Resources. Priority=4294901753. Partition=cloud. [2018-03-13T15:40:17.543] debug2: Testing job time limits and checkpoints [2018-03-13T15:40:47.580] debug2: Testing job time limits and checkpoints [2018-03-13T15:41:16.615] debug2: Performing purge of old job records Many thanks, - Arie