Hi Patrick, for me at least, this is running as expected.
I'm not sure, why you use "sh" as the command for salloc, I never saw that before. If you do not provide a command, the users default shell will be started if the "SallocDefaultCommand" is not set within slurm.conf So, what does $> salloc -n 1 $> srun hostname and what does $> salloc -n 1 srun hostname Best Marcus P.S.: increase debugging might also help, e.g. $> srun -vvvvv hostname Am 10.11.2020 um 11:54 schrieb Patrick Bégou:
Hi, I'm new to slurm (as admin) and I need some help. Testing my initial setup with: [begou@tenibre ~]$ *salloc -n 1 sh* salloc: Granted job allocation 11 sh-4.4$ *squeue* JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON) *11 * all sh begou R 0:16 1 tenibre-0-0 sh-4.4$*srun /usr/bin/hostname* srun: error: timeout waiting for task launch, started 0 of 1 tasks srun: Job step 11.0 aborted before step completely launched. srun: Job step aborted: Waiting up to 32 seconds for job step to finish. srun: error: Timed out waiting for job step to complete I check the connections: *tenibre is the login node* (no daemon running) nc -v tenibre-0-0 6818 nc -v management1 6817 *management1 is the management node* (slurmctld running) nc -v tenibre-0-0 6818 *tenibre-0-0 is the first compute node* (slurmd running) nc -v management1 6817 All tests return "/Ncat: Connected.../" The command "id begou" works on all nodes and I can reach my home directory on the login node and on the compute node. On the compute node slurmd.log shows: [2020-11-10T11:21:38.050]*launch task* *11.0 *request from UID:23455 GID:1036 HOST:172.30.1.254 PORT:42220 [2020-11-10T11:21:38.050] debug: Checking credential with 508 bytes of sig data [2020-11-10T11:21:38.050] _run_prolog: run job script took usec=12 [2020-11-10T11:21:38.050] _run_prolog: prolog with lock for job 11 ran for 0 seconds [2020-11-10T11:21:38.053] debug: AcctGatherEnergy NONE plugin loaded [2020-11-10T11:21:38.053] debug: AcctGatherProfile NONE plugin loaded [2020-11-10T11:21:38.053] debug: AcctGatherInterconnect NONE plugin loaded [2020-11-10T11:21:38.053] debug: AcctGatherFilesystem NONE plugin loaded [2020-11-10T11:21:38.053] debug: switch NONE plugin loaded [2020-11-10T11:21:38.054] [11.0] debug: Job accounting gather NOT_INVOKED plugin loaded [2020-11-10T11:21:38.054] [11.0] debug: Message thread started pid = 12099 [2020-11-10T11:21:38.054] debug: task_p_slurmd_reserve_resources: 11 0 [2020-11-10T11:21:38.068] [11.0] debug: task NONE plugin loaded [2020-11-10T11:21:38.068] [11.0] debug: Checkpoint plugin loaded: checkpoint/none [2020-11-10T11:21:38.068] [11.0] Munge credential signature plugin loaded [2020-11-10T11:21:38.068] [11.0] debug: job_container none plugin loaded [2020-11-10T11:21:38.068] [11.0] debug: mpi type = pmi2 [2020-11-10T11:21:38.068] [11.0] debug: xcgroup_instantiate: cgroup '/sys/fs/cgroup/freezer/slurm' already exists [2020-11-10T11:21:38.068] [11.0] debug: spank: opening plugin stack /etc/slurm/plugstack.conf [2020-11-10T11:21:38.068] [11.0] debug: mpi type = (null) [2020-11-10T11:21:38.068] [11.0] debug: using mpi/pmi2 [2020-11-10T11:21:38.068] [11.0] debug: _setup_stepd_job_info: SLURM_STEP_RESV_PORTS not found in env [2020-11-10T11:21:38.068] [11.0] debug: mpi/pmi2: setup sockets [2020-11-10T11:21:38.069] [11.0] debug: mpi/pmi2: started agent thread [2020-11-10T11:21:38.069] [11.0]*error: connect io: No route to host* [2020-11-10T11:21:38.069] [11.0] error: IO setup failed: No route to host [2020-11-10T11:21:38.069] [11.0] debug: step_terminate_monitor_stop signaling condition [2020-11-10T11:21:38.069] [11.0] error: job_manager exiting abnormally, rc = 4021 [2020-11-10T11:21:38.069] [11.0] debug: Sending launch resp rc=4021 [2020-11-10T11:21:38.069] [11.0] debug: _send_srun_resp_msg: 0/5 *failed to send msg type 6002: No route to host* [2020-11-10T11:21:38.169] [11.0] debug: _send_srun_resp_msg: 1/5 failed to send msg type 6002: No route to host [2020-11-10T11:21:38.370] [11.0] debug: _send_srun_resp_msg: 2/5 failed to send msg type 6002: No route to host [2020-11-10T11:21:38.770] [11.0] debug: _send_srun_resp_msg: 3/5 failed to send msg type 6002: No route to host [2020-11-10T11:21:39.570] [11.0] debug: _send_srun_resp_msg: 4/5 failed to send msg type 6002: No route to host [2020-11-10T11:21:40.370] [11.0] debug: _send_srun_resp_msg: 5/5 failed to send msg type 6002: No route to host [2020-11-10T11:21:40.372] [11.0] debug: Message thread exited [2020-11-10T11:21:40.372] [11.0] debug: mpi/pmi2: agent thread exit [2020-11-10T11:21:40.372] [11.0] *done with job* But I do not understand what this "No route to host" means. Thanks for your help. Patrick
-- Dipl.-Inf. Marcus Wagner IT Center Gruppe: Systemgruppe Linux Abteilung: Systeme und Betrieb RWTH Aachen University Seffenter Weg 23 52074 Aachen Tel: +49 241 80-24383 Fax: +49 241 80-624383 wag...@itc.rwth-aachen.de www.itc.rwth-aachen.de Social Media Kanäle des IT Centers: https://blog.rwth-aachen.de/itc/ https://www.facebook.com/itcenterrwth https://www.linkedin.com/company/itcenterrwth https://twitter.com/ITCenterRWTH https://www.youtube.com/channel/UCKKDJJukeRwO0LP-ac8x8rQ
smime.p7s
Description: S/MIME Cryptographic Signature