I think this message can also happen if the slurm.conf on your login node is 
missing the entry for the slurmd node.  2020 versions have a way to automate 
sync of the configuration.

From: slurm-users <slurm-users-boun...@lists.schedmd.com> On Behalf Of Patrick 
Bégou
Sent: Thursday, November 12, 2020 7:38 AM
To: slurm-users@lists.schedmd.com
Subject: Re: [slurm-users] failed to send msg type 6002: No route to host


This message was sent by an external party.

Hi slurm admins and developpers,

no one has an idea about this problem ?

Still investigating this morning I discover that it works from the management 
node (a small VM running slurmctld) even if I have no home directory on it (I 
use a su command from root to gain unprivileged user setup). It still doesn't 
run from the login node even with all firewall disabled :-(

Patrick

Le 10/11/2020 à 11:54, Patrick Bégou a écrit :

Hi,

I'm new to slurm (as admin) and I need some help. Testing my initial setup with:
[begou@tenibre ~]$ salloc -n 1 sh
salloc: Granted job allocation 11
sh-4.4$ squeue
             JOBID PARTITION     NAME     USER ST       TIME  NODES 
NODELIST(REASON)
                11       all       sh    begou  R       0:16      1 tenibre-0-0
sh-4.4$ srun /usr/bin/hostname
srun: error: timeout waiting for task launch, started 0 of 1 tasks
srun: Job step 11.0 aborted before step completely launched.
srun: Job step aborted: Waiting up to 32 seconds for job step to finish.
srun: error: Timed out waiting for job step to complete

I check the connections:

tenibre is the login node (no daemon running)
nc -v tenibre-0-0 6818
nc -v management1 6817
management1 is the management node (slurmctld running)
nc -v tenibre-0-0 6818
tenibre-0-0 is the first compute node (slurmd running)

nc -v management1 6817

All tests return "Ncat: Connected..."

The command "id begou" works on all nodes and I can reach my home directory on 
the login node and on the compute node.

On the compute node slurmd.log shows:
[2020-11-10T11:21:38.050] launch task 11.0 request from UID:23455 GID:1036 
HOST:172.30.1.254 PORT:42220
[2020-11-10T11:21:38.050] debug:  Checking credential with 508 bytes of sig data
[2020-11-10T11:21:38.050] _run_prolog: run job script took usec=12
[2020-11-10T11:21:38.050] _run_prolog: prolog with lock for job 11 ran for 0 
seconds
[2020-11-10T11:21:38.053] debug:  AcctGatherEnergy NONE plugin loaded
[2020-11-10T11:21:38.053] debug:  AcctGatherProfile NONE plugin loaded
[2020-11-10T11:21:38.053] debug:  AcctGatherInterconnect NONE plugin loaded
[2020-11-10T11:21:38.053] debug:  AcctGatherFilesystem NONE plugin loaded
[2020-11-10T11:21:38.053] debug:  switch NONE plugin loaded
[2020-11-10T11:21:38.054] [11.0] debug:  Job accounting gather NOT_INVOKED 
plugin loaded
[2020-11-10T11:21:38.054] [11.0] debug:  Message thread started pid = 12099
[2020-11-10T11:21:38.054] debug:  task_p_slurmd_reserve_resources: 11 0
[2020-11-10T11:21:38.068] [11.0] debug:  task NONE plugin loaded
[2020-11-10T11:21:38.068] [11.0] debug:  Checkpoint plugin loaded: 
checkpoint/none
[2020-11-10T11:21:38.068] [11.0] Munge credential signature plugin loaded
[2020-11-10T11:21:38.068] [11.0] debug:  job_container none plugin loaded
[2020-11-10T11:21:38.068] [11.0] debug:  mpi type = pmi2
[2020-11-10T11:21:38.068] [11.0] debug:  xcgroup_instantiate: cgroup 
'/sys/fs/cgroup/freezer/slurm' already exists
[2020-11-10T11:21:38.068] [11.0] debug:  spank: opening plugin stack 
/etc/slurm/plugstack.conf
[2020-11-10T11:21:38.068] [11.0] debug:  mpi type = (null)
[2020-11-10T11:21:38.068] [11.0] debug:  using mpi/pmi2
[2020-11-10T11:21:38.068] [11.0] debug:  _setup_stepd_job_info: 
SLURM_STEP_RESV_PORTS not found in env
[2020-11-10T11:21:38.068] [11.0] debug:  mpi/pmi2: setup sockets
[2020-11-10T11:21:38.069] [11.0] debug:  mpi/pmi2: started agent thread
[2020-11-10T11:21:38.069] [11.0] error: connect io: No route to host
[2020-11-10T11:21:38.069] [11.0] error: IO setup failed: No route to host
[2020-11-10T11:21:38.069] [11.0] debug:  step_terminate_monitor_stop signaling 
condition
[2020-11-10T11:21:38.069] [11.0] error: job_manager exiting abnormally, rc = 
4021
[2020-11-10T11:21:38.069] [11.0] debug:  Sending launch resp rc=4021
[2020-11-10T11:21:38.069] [11.0] debug:  _send_srun_resp_msg: 0/5 failed to 
send msg type 6002: No route to host
[2020-11-10T11:21:38.169] [11.0] debug:  _send_srun_resp_msg: 1/5 failed to 
send msg type 6002: No route to host
[2020-11-10T11:21:38.370] [11.0] debug:  _send_srun_resp_msg: 2/5 failed to 
send msg type 6002: No route to host
[2020-11-10T11:21:38.770] [11.0] debug:  _send_srun_resp_msg: 3/5 failed to 
send msg type 6002: No route to host
[2020-11-10T11:21:39.570] [11.0] debug:  _send_srun_resp_msg: 4/5 failed to 
send msg type 6002: No route to host
[2020-11-10T11:21:40.370] [11.0] debug:  _send_srun_resp_msg: 5/5 failed to 
send msg type 6002: No route to host
[2020-11-10T11:21:40.372] [11.0] debug:  Message thread exited
[2020-11-10T11:21:40.372] [11.0] debug:  mpi/pmi2: agent thread exit
[2020-11-10T11:21:40.372] [11.0] done with job



But I do not understand what this "No route to host" means.



Thanks for your help.

Patrick




Reply via email to