Hi slurm admins and developpers, no one has an idea about this problem ?
Still investigating this morning I discover that it works from the management node (a small VM running slurmctld) even if I have no home directory on it (I use a su command from root to gain unprivileged user setup). It still doesn't run from the login node even with all firewall disabled :-( Patrick Le 10/11/2020 à 11:54, Patrick Bégou a écrit : > > Hi, > > I'm new to slurm (as admin) and I need some help. Testing my initial > setup with: > > [begou@tenibre ~]$ *salloc -n 1 sh* > salloc: Granted job allocation 11 > sh-4.4$ *squeue* > JOBID PARTITION NAME USER ST TIME > NODES NODELIST(REASON) > *11 * all sh begou R > 0:16 1 tenibre-0-0 > sh-4.4$*srun /usr/bin/hostname* > srun: error: timeout waiting for task launch, started 0 of 1 tasks > srun: Job step 11.0 aborted before step completely launched. > srun: Job step aborted: Waiting up to 32 seconds for job step to > finish. > srun: error: Timed out waiting for job step to complete > > I check the connections: > > *tenibre is the login node* (no daemon running) > > nc -v tenibre-0-0 6818 > nc -v management1 6817 > > *management1 is the management node* (slurmctld running) > > nc -v tenibre-0-0 6818 > > *tenibre-0-0 is the first compute node* (slurmd running) > > nc -v management1 6817 > > All tests return "/Ncat: Connected.../" > > The command "id begou" works on all nodes and I can reach my home > directory on the login node and on the compute node. > > On the compute node slurmd.log shows: > > [2020-11-10T11:21:38.050]*launch task* *11.0 *request from > UID:23455 GID:1036 HOST:172.30.1.254 PORT:42220 > [2020-11-10T11:21:38.050] debug: Checking credential with 508 > bytes of sig data > [2020-11-10T11:21:38.050] _run_prolog: run job script took usec=12 > [2020-11-10T11:21:38.050] _run_prolog: prolog with lock for job 11 > ran for 0 seconds > [2020-11-10T11:21:38.053] debug: AcctGatherEnergy NONE plugin loaded > [2020-11-10T11:21:38.053] debug: AcctGatherProfile NONE plugin loaded > [2020-11-10T11:21:38.053] debug: AcctGatherInterconnect NONE > plugin loaded > [2020-11-10T11:21:38.053] debug: AcctGatherFilesystem NONE plugin > loaded > [2020-11-10T11:21:38.053] debug: switch NONE plugin loaded > [2020-11-10T11:21:38.054] [11.0] debug: Job accounting gather > NOT_INVOKED plugin loaded > [2020-11-10T11:21:38.054] [11.0] debug: Message thread started > pid = 12099 > [2020-11-10T11:21:38.054] debug: task_p_slurmd_reserve_resources: > 11 0 > [2020-11-10T11:21:38.068] [11.0] debug: task NONE plugin loaded > [2020-11-10T11:21:38.068] [11.0] debug: Checkpoint plugin loaded: > checkpoint/none > [2020-11-10T11:21:38.068] [11.0] Munge credential signature plugin > loaded > [2020-11-10T11:21:38.068] [11.0] debug: job_container none plugin > loaded > [2020-11-10T11:21:38.068] [11.0] debug: mpi type = pmi2 > [2020-11-10T11:21:38.068] [11.0] debug: xcgroup_instantiate: > cgroup '/sys/fs/cgroup/freezer/slurm' already exists > [2020-11-10T11:21:38.068] [11.0] debug: spank: opening plugin > stack /etc/slurm/plugstack.conf > [2020-11-10T11:21:38.068] [11.0] debug: mpi type = (null) > [2020-11-10T11:21:38.068] [11.0] debug: using mpi/pmi2 > [2020-11-10T11:21:38.068] [11.0] debug: _setup_stepd_job_info: > SLURM_STEP_RESV_PORTS not found in env > [2020-11-10T11:21:38.068] [11.0] debug: mpi/pmi2: setup sockets > [2020-11-10T11:21:38.069] [11.0] debug: mpi/pmi2: started agent > thread > [2020-11-10T11:21:38.069] [11.0]*error: connect io: No route to host* > [2020-11-10T11:21:38.069] [11.0] error: IO setup failed: No route > to host > [2020-11-10T11:21:38.069] [11.0] debug: > step_terminate_monitor_stop signaling condition > [2020-11-10T11:21:38.069] [11.0] error: job_manager exiting > abnormally, rc = 4021 > [2020-11-10T11:21:38.069] [11.0] debug: Sending launch resp rc=4021 > [2020-11-10T11:21:38.069] [11.0] debug: _send_srun_resp_msg: 0/5 > *failed to send msg type 6002: No route to host* > [2020-11-10T11:21:38.169] [11.0] debug: _send_srun_resp_msg: 1/5 > failed to send msg type 6002: No route to host > [2020-11-10T11:21:38.370] [11.0] debug: _send_srun_resp_msg: 2/5 > failed to send msg type 6002: No route to host > [2020-11-10T11:21:38.770] [11.0] debug: _send_srun_resp_msg: 3/5 > failed to send msg type 6002: No route to host > [2020-11-10T11:21:39.570] [11.0] debug: _send_srun_resp_msg: 4/5 > failed to send msg type 6002: No route to host > [2020-11-10T11:21:40.370] [11.0] debug: _send_srun_resp_msg: 5/5 > failed to send msg type 6002: No route to host > [2020-11-10T11:21:40.372] [11.0] debug: Message thread exited > [2020-11-10T11:21:40.372] [11.0] debug: mpi/pmi2: agent thread exit > [2020-11-10T11:21:40.372] [11.0] *done with job* > > > But I do not understand what this "No route to host" means. > > > Thanks for your help. > > Patrick > >