This looks like it may be trying to do something using mpi.
What does your slurm.conf look like for that node?
Brian Andrus
On 11/10/2020 2:54 AM, Patrick Bégou wrote:
Hi,
I'm new to slurm (as admin) and I need some help. Testing my initial
setup with:
[begou@tenibre ~]$ *salloc -n 1 sh*
salloc: Granted job allocation 11
sh-4.4$ *squeue*
JOBID PARTITION NAME USER ST TIME NODES
NODELIST(REASON)
*11 * all sh begou R 0:16 1 tenibre-0-0
sh-4.4$*srun /usr/bin/hostname*
srun: error: timeout waiting for task launch, started 0 of 1 tasks
srun: Job step 11.0 aborted before step completely launched.
srun: Job step aborted: Waiting up to 32 seconds for job step to
finish.
srun: error: Timed out waiting for job step to complete
I check the connections:
*tenibre is the login node* (no daemon running)
nc -v tenibre-0-0 6818
nc -v management1 6817
*management1 is the management node* (slurmctld running)
nc -v tenibre-0-0 6818
*tenibre-0-0 is the first compute node* (slurmd running)
nc -v management1 6817
All tests return "/Ncat: Connected.../"
The command "id begou" works on all nodes and I can reach my home
directory on the login node and on the compute node.
On the compute node slurmd.log shows:
[2020-11-10T11:21:38.050]*launch task* *11.0 *request from
UID:23455 GID:1036 HOST:172.30.1.254 PORT:42220
[2020-11-10T11:21:38.050] debug: Checking credential with 508
bytes of sig data
[2020-11-10T11:21:38.050] _run_prolog: run job script took usec=12
[2020-11-10T11:21:38.050] _run_prolog: prolog with lock for job 11
ran for 0 seconds
[2020-11-10T11:21:38.053] debug: AcctGatherEnergy NONE plugin loaded
[2020-11-10T11:21:38.053] debug: AcctGatherProfile NONE plugin loaded
[2020-11-10T11:21:38.053] debug: AcctGatherInterconnect NONE
plugin loaded
[2020-11-10T11:21:38.053] debug: AcctGatherFilesystem NONE plugin
loaded
[2020-11-10T11:21:38.053] debug: switch NONE plugin loaded
[2020-11-10T11:21:38.054] [11.0] debug: Job accounting gather
NOT_INVOKED plugin loaded
[2020-11-10T11:21:38.054] [11.0] debug: Message thread started
pid = 12099
[2020-11-10T11:21:38.054] debug: task_p_slurmd_reserve_resources: 11 0
[2020-11-10T11:21:38.068] [11.0] debug: task NONE plugin loaded
[2020-11-10T11:21:38.068] [11.0] debug: Checkpoint plugin loaded:
checkpoint/none
[2020-11-10T11:21:38.068] [11.0] Munge credential signature plugin
loaded
[2020-11-10T11:21:38.068] [11.0] debug: job_container none plugin
loaded
[2020-11-10T11:21:38.068] [11.0] debug: mpi type = pmi2
[2020-11-10T11:21:38.068] [11.0] debug: xcgroup_instantiate:
cgroup '/sys/fs/cgroup/freezer/slurm' already exists
[2020-11-10T11:21:38.068] [11.0] debug: spank: opening plugin
stack /etc/slurm/plugstack.conf
[2020-11-10T11:21:38.068] [11.0] debug: mpi type = (null)
[2020-11-10T11:21:38.068] [11.0] debug: using mpi/pmi2
[2020-11-10T11:21:38.068] [11.0] debug: _setup_stepd_job_info:
SLURM_STEP_RESV_PORTS not found in env
[2020-11-10T11:21:38.068] [11.0] debug: mpi/pmi2: setup sockets
[2020-11-10T11:21:38.069] [11.0] debug: mpi/pmi2: started agent
thread
[2020-11-10T11:21:38.069] [11.0]*error: connect io: No route to host*
[2020-11-10T11:21:38.069] [11.0] error: IO setup failed: No route
to host
[2020-11-10T11:21:38.069] [11.0] debug:
step_terminate_monitor_stop signaling condition
[2020-11-10T11:21:38.069] [11.0] error: job_manager exiting
abnormally, rc = 4021
[2020-11-10T11:21:38.069] [11.0] debug: Sending launch resp rc=4021
[2020-11-10T11:21:38.069] [11.0] debug: _send_srun_resp_msg: 0/5
*failed to send msg type 6002: No route to host*
[2020-11-10T11:21:38.169] [11.0] debug: _send_srun_resp_msg: 1/5
failed to send msg type 6002: No route to host
[2020-11-10T11:21:38.370] [11.0] debug: _send_srun_resp_msg: 2/5
failed to send msg type 6002: No route to host
[2020-11-10T11:21:38.770] [11.0] debug: _send_srun_resp_msg: 3/5
failed to send msg type 6002: No route to host
[2020-11-10T11:21:39.570] [11.0] debug: _send_srun_resp_msg: 4/5
failed to send msg type 6002: No route to host
[2020-11-10T11:21:40.370] [11.0] debug: _send_srun_resp_msg: 5/5
failed to send msg type 6002: No route to host
[2020-11-10T11:21:40.372] [11.0] debug: Message thread exited
[2020-11-10T11:21:40.372] [11.0] debug: mpi/pmi2: agent thread exit
[2020-11-10T11:21:40.372] [11.0] *done with job*
But I do not understand what this "No route to host" means.
Thanks for your help.
Patrick