[slurm-users] ec2 elastic node

Arie Blumenzweig Thu, 15 Mar 2018 00:07:53 -0700

Hello,

I'm a bit lost trying to make a single ec2 node work.  Would appreciate
your help!


I have a single aws/ec2 CLOUD node, named slurm-node0.  The instance is
currently down:

# aws-ec2-list-instances | grep  slurm-node0
i-006506267531a0511  slurm-node0              stopped


I'm doing an allocation with salloc.  I expect the ResumeProgram to get
called, it is not.
I assume it's related to the various Time and Timeout values ...

Some info:

# slurmctld -V
slurm 17.11.3-2

# sinfo
PARTITION AVAIL  TIMELIMIT  NODES  STATE NODELIST
cloud*       up   infinite      1  down* slurm-node0

# sinfo -Nle
Tue Mar 13 15:51:06 2018
NODELIST     NODES PARTITION       STATE CPUS    S:C:T MEMORY TMP_DISK
WEIGHT AVAIL_FE REASON
slurm-node0      1    cloud*       down*    1    1:1:1      1        0
1    cloud Not responding



My config file:

ClusterName=slurm-aws
ControlMachine=gsi-db-srv
SlurmUser=slurm
SlurmctldPort=6817
SlurmdPort=6818
AuthType=auth/munge
StateSaveLocation=/var/spool/slurm/ctld
SlurmdSpoolDir=/var/spool/slurm/d
SwitchType=switch/none
MpiDefault=none
SlurmctldPidFile=/var/run/slurmctld.pid
SlurmdPidFile=/var/run/slurmd.pid
ProctrackType=proctrack/linuxproc
ReturnToService=1
TreeWidth=128
SlurmctldTimeout=120
SlurmdTimeout=300
InactiveLimit=0
MinJobAge=300
KillWait=30
Waittime=0
SchedulerType=sched/builtin
SelectType=select/linear
FastSchedule=1
PriorityType=priority/basic
SlurmctldDebug=10
SlurmctldLogFile=/var/log/slurmctld.log
SlurmdDebug=10
SlurmdLogFile=/var/log/slurmd.log
JobCompType=jobcomp/none
ResumeTimeout=360
SuspendTimeout=300
SuspendTime=600
SuspendProgram=/usr/share/slurm-gsi/bin/suspend_program
ResumeProgram=/usr/share/slurm-gsi/bin/resume_program
NodeName=slurm-node0 State=CLOUD Feature=cloud Weight=1
PartitionName=cloud Nodes=slurm-node0 Default=yes State=up


Some of my slurmctld.log:

[2018-03-13T15:38:16.386] debug3: Writing job id 35 to header record of
job_state file
[2018-03-13T15:38:16.556] debug2: sched: Processing RPC:
REQUEST_RESOURCE_ALLOCATION from uid=0
[2018-03-13T15:38:16.556] debug3: JobDesc: user_id=0 job_id=N/A
partition=(null) name=bash
[2018-03-13T15:38:16.556] debug3:    cpus=1-4294967294 pn_min_cpus=-1
core_spec=-1
[2018-03-13T15:38:16.556] debug3:    Nodes=1-[4294967294] Sock/Node=65534
Core/Sock=65534 Thread/Core=65534
[2018-03-13T15:38:16.556] debug3:    pn_min_memory_job=18446744073709551615
pn_min_tmp_disk=-1
[2018-03-13T15:38:16.556] debug3:    immediate=0 reservation=(null)
[2018-03-13T15:38:16.556] debug3:    features=(null) cluster_features=(null)
[2018-03-13T15:38:16.556] debug3:    req_nodes=(null) exc_nodes=(null)
gres=(null)
[2018-03-13T15:38:16.556] debug3:    time_limit=-1--1 priority=-1
contiguous=0 shared=-1
[2018-03-13T15:38:16.556] debug3:    kill_on_node_fail=-1 script=(null)
[2018-03-13T15:38:16.556] debug3:    stdin=(null) stdout=(null)
stderr=(null)
[2018-03-13T15:38:16.556] debug3:    work_dir=/tmp
alloc_node:sid=gsi-db-srv:30214
[2018-03-13T15:38:16.556] debug3:    power_flags=
[2018-03-13T15:38:16.556] debug3:    resp_host=172.31.38.230
alloc_resp_port=41716 other_port=46570
[2018-03-13T15:38:16.556] debug3:    dependency=(null) account=(null)
qos=(null) comment=(null)
[2018-03-13T15:38:16.556] debug3:    mail_type=0 mail_user=(null) nice=0
num_tasks=-1 open_mode=0 overcommit=-1 acctg_freq=(null)
[2018-03-13T15:38:16.556] debug3:    network=(null) begin=Unknown
cpus_per_task=-1 requeue=-1 licenses=(null)
[2018-03-13T15:38:16.556] debug3:    end_time= signal=0@0 wait_all_nodes=-1
cpu_freq=
[2018-03-13T15:38:16.556] debug3:    ntasks_per_node=-1
ntasks_per_socket=-1 ntasks_per_core=-1
[2018-03-13T15:38:16.556] debug3:    mem_bind=65534:(null) plane_size:65534
[2018-03-13T15:38:16.556] debug3:    array_inx=(null)
[2018-03-13T15:38:16.556] debug3:    burst_buffer=(null)
[2018-03-13T15:38:16.556] debug3:    mcs_label=(null)
[2018-03-13T15:38:16.556] debug3:    deadline=Unknown
[2018-03-13T15:38:16.556] debug3:    bitflags=0 delay_boot=4294967294
[2018-03-13T15:38:16.557] debug3: before alteration asking for nodes
1-4294967294 cpus 1-4294967294
[2018-03-13T15:38:16.557] debug3: after alteration asking for nodes
1-4294967294 cpus 1-4294967294
[2018-03-13T15:38:16.557] debug2: found 1 usable nodes from config
containing slurm-node0
[2018-03-13T15:38:16.557] debug3: _pick_best_nodes: job 36 idle_nodes 1
share_nodes 1
[2018-03-13T15:38:16.557] debug5: powercapping: checking job 36 : skipped,
not eligible
[2018-03-13T15:38:16.557] debug3: JobId=36 required nodes not avail
[2018-03-13T15:38:16.557] sched: _slurm_rpc_allocate_resources JobId=36
NodeList=(null) usec=521
[2018-03-13T15:38:17.398] debug2: Testing job time limits and checkpoints
[2018-03-13T15:38:19.401] debug:  Spawning registration agent for
slurm-node0 1 hosts
[2018-03-13T15:38:19.401] debug2: Spawning RPC agent for msg_type
REQUEST_NODE_REGISTRATION_STATUS
[2018-03-13T15:38:19.401] debug:  sched: Running job scheduler
[2018-03-13T15:38:19.401] debug3: sched: JobId=35. State=PENDING.
Reason=Resources. Priority=4294901754. Partition=cloud.
[2018-03-13T15:38:19.401] debug3: sched: JobId=36. State=PENDING.
Reason=Resources. Priority=4294901753. Partition=cloud.
[2018-03-13T15:38:19.401] debug2: got 1 threads to send out
[2018-03-13T15:38:19.401] debug2: Tree head got back 0 looking for 1
[2018-03-13T15:38:19.401] debug3: Tree sending to slurm-node0
[2018-03-13T15:38:21.004] debug3: Writing job id 36 to header record of
job_state file
[2018-03-13T15:38:21.401] debug2: slurm_connect poll timeout: Connection
timed out
[2018-03-13T15:38:21.401] debug2: Error connecting slurm stream socket at
172.31.38.99:6818: Connection timed out
[2018-03-13T15:38:21.402] debug3: problems with slurm-node0
[2018-03-13T15:38:21.402] debug2: Tree head got back 1
[2018-03-13T15:38:47.434] debug2: Testing job time limits and checkpoints
[2018-03-13T15:39:16.469] debug2: Performing purge of old job records
[2018-03-13T15:39:16.469] debug:  sched: Running job scheduler
[2018-03-13T15:39:16.469] debug3: sched: JobId=35. State=PENDING.
Reason=Resources. Priority=4294901754. Partition=cloud.
[2018-03-13T15:39:16.469] debug3: sched: JobId=36. State=PENDING.
Reason=Resources. Priority=4294901753. Partition=cloud.
[2018-03-13T15:39:17.470] debug2: Testing job time limits and checkpoints
[2018-03-13T15:39:47.506] debug2: Testing job time limits and checkpoints
[2018-03-13T15:39:59.520] debug:  Spawning registration agent for
slurm-node0 1 hosts
[2018-03-13T15:39:59.520] debug2: Spawning RPC agent for msg_type
REQUEST_NODE_REGISTRATION_STATUS
[2018-03-13T15:39:59.521] debug2: got 1 threads to send out
[2018-03-13T15:39:59.521] debug2: Tree head got back 0 looking for 1
[2018-03-13T15:39:59.521] debug3: Tree sending to slurm-node0
[2018-03-13T15:40:01.521] debug2: slurm_connect poll timeout: Connection
timed out
[2018-03-13T15:40:01.521] debug2: Error connecting slurm stream socket at
172.31.38.99:6818: Connection timed out
[2018-03-13T15:40:01.521] debug3: problems with slurm-node0
[2018-03-13T15:40:01.521] debug2: Tree head got back 1
[2018-03-13T15:40:09.788] debug3: Processing RPC: REQUEST_NODE_INFO from
uid=0
[2018-03-13T15:40:09.790] debug2: Processing RPC: REQUEST_PARTITION_INFO
uid=0
[2018-03-13T15:40:09.790] debug2: _slurm_rpc_dump_partitions, size=184
usec=46
[2018-03-13T15:40:14.539] debug4: No backup slurmctld to ping
[2018-03-13T15:40:16.541] debug2: Performing purge of old job records
[2018-03-13T15:40:16.541] debug:  sched: Running job scheduler
[2018-03-13T15:40:16.542] debug3: sched: JobId=35. State=PENDING.
Reason=Resources. Priority=4294901754. Partition=cloud.
[2018-03-13T15:40:16.542] debug3: sched: JobId=36. State=PENDING.
Reason=Resources. Priority=4294901753. Partition=cloud.
[2018-03-13T15:40:17.543] debug2: Testing job time limits and checkpoints
[2018-03-13T15:40:47.580] debug2: Testing job time limits and checkpoints
[2018-03-13T15:41:16.615] debug2: Performing purge of old job records


Many thanks,

- Arie

[slurm-users] ec2 elastic node

Reply via email to