Hi Marcus thanks for your contact. I'm new to slurm deployment and I do not remember where I found this command to check slurm setup. The SallocDefaultCommand is not defined in my slurm.conf file
That is strange for me is that it works on the node hosting slurmctld, and on the compute node too. On the compute node, connected as root and then using "su - begou": [root@tenibre-0-0 ~]# *su - begou* Last login: Tue Nov 10 20:49:45 CET 2020 on pts/0 [begou@tenibre-0-0 ~]$ *sinfo* PARTITION AVAIL TIMELIMIT NODES STATE NODELIST equipment_typeC up infinite 1 idle tenibre-0-0 all* up infinite 1 idle tenibre-0-0 [begou@tenibre-0-0 ~]$ *squeue* JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON) [begou@tenibre-0-0 ~]$ *salloc -n 1 srun hostname * salloc: Granted job allocation 45 tenibre-0-0 salloc: Relinquishing job allocation 45 [begou@tenibre-0-0 ~]$ On the management node, connected as root and then using "su - begou" (with no home directory available): [root@management1 ~]# *su - begou* Creating home directory for begou. Last login: Thu Nov 12 12:43:47 CET 2020 on pts/1 su: warning: cannot change directory to /HA/sources/begou: No such file or directory [begou@management1 root]$ *sinfo* PARTITION AVAIL TIMELIMIT NODES STATE NODELIST equipment_typeC up infinite 1 idle tenibre-0-0 all* up infinite 1 idle tenibre-0-0 [begou@management1 root]$ *squeue* JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON) [begou@management1 root]$ *salloc -n 1 srun hostname * salloc: Granted job allocation 46 slurmstepd: error: couldn't chdir to `/root': Permission denied: going to /tmp instead tenibre-0-0 salloc: Relinquishing job allocation 46 [begou@management1 root]$ But not on the login node where I need it.... Le 12/11/2020 à 14:05, Marcus Wagner a écrit : > > for me at least, this is running as expected. > > I'm not sure, why you use "sh" as the command for salloc, I never saw > that before. If you do not provide a command, the users default shell > will be started if the "SallocDefaultCommand" is not set within > slurm.conf > So, what does > $> salloc -n 1 > $> srun hostname *This command hangs** * > ** > and what does > $> salloc -n 1 srun hostname > *this command hangs too* from the login node.* * > ** > Best > Marcus > > > P.S.: > > increase debugging might also help, e.g. > > $> srun -vvvvv hostname > Yes I try this but wasn't able to find pertinent information. *This is what I get*: [begou@tenibre ~]$ *salloc -n 1 "srun -vvvvv hostname"* salloc: Granted job allocation 43 salloc: error: _fork_command: Unable to find command "srun -vvvvv hostname" salloc: Relinquishing job allocation 43 [begou@tenibre ~]$ salloc -n 1 srun -vvvvv hostname salloc: Granted job allocation 44 srun: defined options srun: -------------------- -------------------- srun: (null) : tenibre-0-0 srun: jobid : 44 srun: job-name : srun srun: nodes : 1 srun: ntasks : 1 srun: verbose : 5 srun: -------------------- -------------------- srun: end of defined options srun: debug: propagating RLIMIT_CPU=18446744073709551615 srun: debug: propagating RLIMIT_FSIZE=18446744073709551615 srun: debug: propagating RLIMIT_DATA=18446744073709551615 srun: debug: propagating RLIMIT_STACK=8388608 srun: debug: propagating RLIMIT_CORE=18446744073709551615 srun: debug: propagating RLIMIT_RSS=18446744073709551615 srun: debug: propagating RLIMIT_NPROC=512946 srun: debug: propagating RLIMIT_NOFILE=1024 srun: debug: propagating RLIMIT_MEMLOCK=65536 srun: debug: propagating RLIMIT_AS=18446744073709551615 srun: debug: propagating SLURM_PRIO_PROCESS=0 srun: debug: propagating UMASK=0022 srun: debug2: srun PMI messages to port=44969 srun: debug3: Trying to load plugin /usr/lib64/slurm/auth_munge.so srun: debug: Munge authentication plugin loaded srun: debug3: Success. srun: jobid 44: nodes(1):`tenibre-0-0', cpu counts: 1(x1) srun: debug2: creating job with 1 tasks srun: debug: requesting job 44, user 23455, nodes 1 including ((null)) srun: debug: cpus 1, tasks 1, name hostname, relative 65534 srun: CpuBindType=(null type) srun: debug: Entering slurm_step_launch srun: debug: mpi type = (null) srun: debug: Using mpi/none srun: debug: Entering _msg_thr_create() srun: debug4: eio: handling events for 2 objects srun: debug3: eio_message_socket_readable: shutdown 0 fd 10 srun: debug3: eio_message_socket_readable: shutdown 0 fd 6 srun: debug: initialized stdio listening socket, port 34531 srun: debug: Started IO server thread (139644034881280) srun: debug: Entering _launch_tasks srun: debug3: IO thread pid = 1733164 srun: debug4: eio: handling events for 4 objects srun: launching 44.0 on host tenibre-0-0, 1 tasks: 0 srun: debug3: uid:23455 gid:1036 cwd:/HA/sources/begou 0 srun: debug2: Called _file_readable srun: debug3: false, all ioservers not yet initialized srun: debug2: Called _file_writable srun: debug3: false srun: debug3: Trying to load plugin /usr/lib64/slurm/route_default.so srun: debug3: eof is false srun: debug2: Called _file_writable srun: debug3: false srun: debug3: eof is false srun: debug3: Called _listening_socket_readable srun: route default plugin loaded srun: debug3: Success. srun: debug2: Tree head got back 0 looking for 1 srun: debug3: Tree sending to tenibre-0-0 srun: debug4: orig_timeout was 20000 we have 0 steps and a timeout of 20000 srun: debug2: Tree head got back 1 srun: debug: launch returned msg_rc=0 err=0 type=8001 srun: error: timeout waiting for task launch, started 0 of 1 tasks srun: Job step 44.0 aborted before step completely launched. srun: Job step aborted: Waiting up to 32 seconds for job step to finish. srun: error: Timed out waiting for job step to complete srun: debug4: eio: handling events for 2 objects srun: debug3: eio_message_socket_readable: shutdown 1 fd 10 srun: debug2: false, shutdown srun: debug3: eio_message_socket_readable: shutdown 1 fd 6 srun: debug2: false, shutdown srun: debug4: eio: handling events for 4 objects srun: debug2: Called _file_readable srun: debug3: false, shutdown srun: debug2: Called _file_writable srun: debug3: false srun: debug3: eof is false srun: debug2: Called _file_writable srun: debug3: false srun: debug3: eof is false srun: debug3: Called _listening_socket_readable srun: debug2: false, shutdown srun: debug: IO thread exiting salloc: Relinquishing job allocation 44 [begou@tenibre ~]$ This problem looks really strange for me.... Patrick > Am 10.11.2020 um 11:54 schrieb Patrick Bégou: >> Hi, >> >> I'm new to slurm (as admin) and I need some help. Testing my initial >> setup with: >> >> [begou@tenibre ~]$ *salloc -n 1 sh* >> salloc: Granted job allocation 11 >> sh-4.4$ *squeue* >> JOBID PARTITION NAME USER ST TIME >> NODES NODELIST(REASON) >> *11 * all sh begou R 0:16 1 tenibre-0-0 >> sh-4.4$*srun /usr/bin/hostname* >> srun: error: timeout waiting for task launch, started 0 of 1 tasks >> srun: Job step 11.0 aborted before step completely launched. >> srun: Job step aborted: Waiting up to 32 seconds for job step to >> finish. >> srun: error: Timed out waiting for job step to complete >> >> I check the connections: >> >> *tenibre is the login node* (no daemon running) >> >> nc -v tenibre-0-0 6818 >> nc -v management1 6817 >> >> *management1 is the management node* (slurmctld running) >> >> nc -v tenibre-0-0 6818 >> >> *tenibre-0-0 is the first compute node* (slurmd running) >> >> nc -v management1 6817 >> >> All tests return "/Ncat: Connected.../" >> >> The command "id begou" works on all nodes and I can reach my home >> directory on the login node and on the compute node. >> >> On the compute node slurmd.log shows: >> >> [2020-11-10T11:21:38.050]*launch task* *11.0 *request from >> UID:23455 GID:1036 HOST:172.30.1.254 PORT:42220 >> [2020-11-10T11:21:38.050] debug: Checking credential with 508 >> bytes of sig data >> [2020-11-10T11:21:38.050] _run_prolog: run job script took usec=12 >> [2020-11-10T11:21:38.050] _run_prolog: prolog with lock for job >> 11 ran for 0 seconds >> [2020-11-10T11:21:38.053] debug: AcctGatherEnergy NONE plugin >> loaded >> [2020-11-10T11:21:38.053] debug: AcctGatherProfile NONE plugin >> loaded >> [2020-11-10T11:21:38.053] debug: AcctGatherInterconnect NONE >> plugin loaded >> [2020-11-10T11:21:38.053] debug: AcctGatherFilesystem NONE >> plugin loaded >> [2020-11-10T11:21:38.053] debug: switch NONE plugin loaded >> [2020-11-10T11:21:38.054] [11.0] debug: Job accounting gather >> NOT_INVOKED plugin loaded >> [2020-11-10T11:21:38.054] [11.0] debug: Message thread started >> pid = 12099 >> [2020-11-10T11:21:38.054] debug: >> task_p_slurmd_reserve_resources: 11 0 >> [2020-11-10T11:21:38.068] [11.0] debug: task NONE plugin loaded >> [2020-11-10T11:21:38.068] [11.0] debug: Checkpoint plugin >> loaded: checkpoint/none >> [2020-11-10T11:21:38.068] [11.0] Munge credential signature >> plugin loaded >> [2020-11-10T11:21:38.068] [11.0] debug: job_container none >> plugin loaded >> [2020-11-10T11:21:38.068] [11.0] debug: mpi type = pmi2 >> [2020-11-10T11:21:38.068] [11.0] debug: xcgroup_instantiate: >> cgroup '/sys/fs/cgroup/freezer/slurm' already exists >> [2020-11-10T11:21:38.068] [11.0] debug: spank: opening plugin >> stack /etc/slurm/plugstack.conf >> [2020-11-10T11:21:38.068] [11.0] debug: mpi type = (null) >> [2020-11-10T11:21:38.068] [11.0] debug: using mpi/pmi2 >> [2020-11-10T11:21:38.068] [11.0] debug: _setup_stepd_job_info: >> SLURM_STEP_RESV_PORTS not found in env >> [2020-11-10T11:21:38.068] [11.0] debug: mpi/pmi2: setup sockets >> [2020-11-10T11:21:38.069] [11.0] debug: mpi/pmi2: started agent >> thread >> [2020-11-10T11:21:38.069] [11.0]*error: connect io: No route to >> host* >> [2020-11-10T11:21:38.069] [11.0] error: IO setup failed: No route >> to host >> [2020-11-10T11:21:38.069] [11.0] debug: >> step_terminate_monitor_stop signaling condition >> [2020-11-10T11:21:38.069] [11.0] error: job_manager exiting >> abnormally, rc = 4021 >> [2020-11-10T11:21:38.069] [11.0] debug: Sending launch resp rc=4021 >> [2020-11-10T11:21:38.069] [11.0] debug: _send_srun_resp_msg: 0/5 >> *failed to send msg type 6002: No route to host* >> [2020-11-10T11:21:38.169] [11.0] debug: _send_srun_resp_msg: 1/5 >> failed to send msg type 6002: No route to host >> [2020-11-10T11:21:38.370] [11.0] debug: _send_srun_resp_msg: 2/5 >> failed to send msg type 6002: No route to host >> [2020-11-10T11:21:38.770] [11.0] debug: _send_srun_resp_msg: 3/5 >> failed to send msg type 6002: No route to host >> [2020-11-10T11:21:39.570] [11.0] debug: _send_srun_resp_msg: 4/5 >> failed to send msg type 6002: No route to host >> [2020-11-10T11:21:40.370] [11.0] debug: _send_srun_resp_msg: 5/5 >> failed to send msg type 6002: No route to host >> [2020-11-10T11:21:40.372] [11.0] debug: Message thread exited >> [2020-11-10T11:21:40.372] [11.0] debug: mpi/pmi2: agent thread exit >> [2020-11-10T11:21:40.372] [11.0] *done with job* >> >> >> But I do not understand what this "No route to host" means. >> >> >> Thanks for your help. >> >> Patrick >> >> >