Hi Patrick,

for me at least, this is running as expected.

I'm not sure, why you use "sh" as the command for salloc, I never saw that before. If you 
do not provide a command, the users default shell will be started if the 
"SallocDefaultCommand" is not set within slurm.conf


So, what does
$> salloc -n 1
$> srun hostname

and what does
$> salloc -n 1 srun hostname


Best
Marcus


P.S.:

increase debugging might also help, e.g.

$> srun -vvvvv hostname

Am 10.11.2020 um 11:54 schrieb Patrick Bégou:
Hi,

I'm new to slurm (as admin) and I need some help. Testing my initial setup with:

    [begou@tenibre ~]$ *salloc -n 1 sh*
    salloc: Granted job allocation 11
    sh-4.4$ *squeue*
                  JOBID PARTITION     NAME     USER ST       TIME NODES 
NODELIST(REASON)
    *11 *      all       sh    begou  R 0:16      1 tenibre-0-0
    sh-4.4$*srun /usr/bin/hostname*
    srun: error: timeout waiting for task launch, started 0 of 1 tasks
    srun: Job step 11.0 aborted before step completely launched.
    srun: Job step aborted: Waiting up to 32 seconds for job step to finish.
    srun: error: Timed out waiting for job step to complete

I check the connections:

*tenibre is the login node* (no daemon running)

    nc -v tenibre-0-0 6818
    nc -v management1 6817

*management1 is the management node* (slurmctld running)

    nc -v tenibre-0-0 6818

*tenibre-0-0 is the first compute node* (slurmd running)

    nc -v management1 6817

All tests return "/Ncat: Connected.../"

The command "id begou" works on all nodes and I can reach my home directory on 
the login node and on the compute node.

On the compute node slurmd.log shows:

    [2020-11-10T11:21:38.050]*launch task* *11.0 *request from UID:23455 
GID:1036 HOST:172.30.1.254 PORT:42220
    [2020-11-10T11:21:38.050] debug:  Checking credential with 508 bytes of sig 
data
    [2020-11-10T11:21:38.050] _run_prolog: run job script took usec=12
    [2020-11-10T11:21:38.050] _run_prolog: prolog with lock for job 11 ran for 
0 seconds
    [2020-11-10T11:21:38.053] debug:  AcctGatherEnergy NONE plugin loaded
    [2020-11-10T11:21:38.053] debug:  AcctGatherProfile NONE plugin loaded
    [2020-11-10T11:21:38.053] debug:  AcctGatherInterconnect NONE plugin loaded
    [2020-11-10T11:21:38.053] debug:  AcctGatherFilesystem NONE plugin loaded
    [2020-11-10T11:21:38.053] debug:  switch NONE plugin loaded
    [2020-11-10T11:21:38.054] [11.0] debug:  Job accounting gather NOT_INVOKED 
plugin loaded
    [2020-11-10T11:21:38.054] [11.0] debug:  Message thread started pid = 12099
    [2020-11-10T11:21:38.054] debug:  task_p_slurmd_reserve_resources: 11 0
    [2020-11-10T11:21:38.068] [11.0] debug:  task NONE plugin loaded
    [2020-11-10T11:21:38.068] [11.0] debug:  Checkpoint plugin loaded: 
checkpoint/none
    [2020-11-10T11:21:38.068] [11.0] Munge credential signature plugin loaded
    [2020-11-10T11:21:38.068] [11.0] debug:  job_container none plugin loaded
    [2020-11-10T11:21:38.068] [11.0] debug:  mpi type = pmi2
    [2020-11-10T11:21:38.068] [11.0] debug:  xcgroup_instantiate: cgroup 
'/sys/fs/cgroup/freezer/slurm' already exists
    [2020-11-10T11:21:38.068] [11.0] debug:  spank: opening plugin stack 
/etc/slurm/plugstack.conf
    [2020-11-10T11:21:38.068] [11.0] debug:  mpi type = (null)
    [2020-11-10T11:21:38.068] [11.0] debug:  using mpi/pmi2
    [2020-11-10T11:21:38.068] [11.0] debug:  _setup_stepd_job_info: 
SLURM_STEP_RESV_PORTS not found in env
    [2020-11-10T11:21:38.068] [11.0] debug:  mpi/pmi2: setup sockets
    [2020-11-10T11:21:38.069] [11.0] debug:  mpi/pmi2: started agent thread
    [2020-11-10T11:21:38.069] [11.0]*error: connect io: No route to host*
    [2020-11-10T11:21:38.069] [11.0] error: IO setup failed: No route to host
    [2020-11-10T11:21:38.069] [11.0] debug: step_terminate_monitor_stop 
signaling condition
    [2020-11-10T11:21:38.069] [11.0] error: job_manager exiting abnormally, rc 
= 4021
    [2020-11-10T11:21:38.069] [11.0] debug:  Sending launch resp rc=4021
    [2020-11-10T11:21:38.069] [11.0] debug:  _send_srun_resp_msg: 0/5 *failed 
to send msg type 6002: No route to host*
    [2020-11-10T11:21:38.169] [11.0] debug:  _send_srun_resp_msg: 1/5 failed to 
send msg type 6002: No route to host
    [2020-11-10T11:21:38.370] [11.0] debug:  _send_srun_resp_msg: 2/5 failed to 
send msg type 6002: No route to host
    [2020-11-10T11:21:38.770] [11.0] debug:  _send_srun_resp_msg: 3/5 failed to 
send msg type 6002: No route to host
    [2020-11-10T11:21:39.570] [11.0] debug:  _send_srun_resp_msg: 4/5 failed to 
send msg type 6002: No route to host
    [2020-11-10T11:21:40.370] [11.0] debug:  _send_srun_resp_msg: 5/5 failed to 
send msg type 6002: No route to host
    [2020-11-10T11:21:40.372] [11.0] debug:  Message thread exited
    [2020-11-10T11:21:40.372] [11.0] debug:  mpi/pmi2: agent thread exit
    [2020-11-10T11:21:40.372] [11.0] *done with job*


But I do not understand what this "No route to host" means.


Thanks for your help.

Patrick



--
Dipl.-Inf. Marcus Wagner

IT Center
Gruppe: Systemgruppe Linux
Abteilung: Systeme und Betrieb
RWTH Aachen University
Seffenter Weg 23
52074 Aachen
Tel: +49 241 80-24383
Fax: +49 241 80-624383
wag...@itc.rwth-aachen.de
www.itc.rwth-aachen.de

Social Media Kanäle des IT Centers:
https://blog.rwth-aachen.de/itc/
https://www.facebook.com/itcenterrwth
https://www.linkedin.com/company/itcenterrwth
https://twitter.com/ITCenterRWTH
https://www.youtube.com/channel/UCKKDJJukeRwO0LP-ac8x8rQ

Attachment: smime.p7s
Description: S/MIME Cryptographic Signature

Reply via email to