Disabling the firewall service on the centos client allows the ‘srun hostname’ command to run.
From: Buckley, Ronan Sent: Tuesday, July 17, 2018 12:00 PM To: 'Slurm User Community List' Subject: RE: [slurm-users] 'srun hostname' hangs on the command line Hi Carlos, Is there a way to test that? Are there certain ports that need to be open? Thanks. From: slurm-users [mailto:slurm-users-boun...@lists.schedmd.com] On Behalf Of Carlos Fenoy Sent: Tuesday, July 17, 2018 11:55 AM To: Slurm User Community List Subject: Re: [slurm-users] 'srun hostname' hangs on the command line The communication from the compute nodes to the login nodes may be block by the firewall. That will prevent srun from running properly Sent from my iPhone On 17 Jul 2018, at 10:16, John Hearns <hear...@googlemail.com<mailto:hear...@googlemail.com>> wrote: Ronan, as far as I can see this means that you cannot launch a job. What state are the compute nodes in when you run sinfo? On 17 July 2018 at 10:08, Buckley, Ronan <ronan.buck...@dell.com<mailto:ronan.buck...@dell.com>> wrote: Yes, srun just hangs. Commands like sinfo and squeue run fine. I also have no slurm logs in /var/log ?? From: slurm-users [mailto:slurm-users-boun...@lists.schedmd.com<mailto:slurm-users-boun...@lists.schedmd.com>] On Behalf Of John Hearns Sent: Tuesday, July 17, 2018 8:57 AM To: Slurm User Community List Subject: Re: [slurm-users] 'srun hostname' hangs on the command line Ronan, sorry to ask but this is a bit unclear. Are you unable to launch ANY sessions with srun? In which case you need to look at the logs to see why the job is not being scheduled. Is it only the hostname command which fails? I would guess very much you have already run an ssh into a node and run the hostname command manually. On 17 July 2018 at 09:50, Buckley, Ronan <ronan.buck...@dell.com<mailto:ronan.buck...@dell.com>> wrote: Yes I do. From: slurm-users [mailto:slurm-users-boun...@lists.schedmd.com<mailto:slurm-users-boun...@lists.schedmd.com>] On Behalf Of Williams, Gareth (IM&T, Clayton) Sent: Tuesday, July 17, 2018 12:33 AM To: Slurm User Community List Subject: Re: [slurm-users] 'srun hostname' hangs on the command line Do you get the same problem as a non-root user? From: slurm-users [mailto:slurm-users-boun...@lists.schedmd.com] On Behalf Of Buckley, Ronan Sent: Tuesday, 17 July 2018 12:53 AM To: slurm-users@lists.schedmd.com<mailto:slurm-users@lists.schedmd.com> Subject: [slurm-users] 'srun hostname' hangs on the command line Hi All, Verbose mode doesn’t show much. I hashed out the hostnames. Any ideas/suggestions? # srun hostname ^Csrun: interrupt (one more within 1 sec to abort) srun: task 0: unknown ^Z [1]+ Stopped srun hostname # # srun -v hostname srun: defined options for program `srun' srun: --------------- --------------------- srun: user : `root' srun: uid : 0 srun: gid : 0 srun: cwd : /root srun: ntasks : 1 (default) srun: nodes : 1 (default) srun: jobid : 4294967294 (default) srun: partition : default srun: profile : `NotSet' srun: job name : `(null)' srun: reservation : `(null)' srun: burst_buffer : `(null)' srun: wckey : `(null)' srun: cpu_freq_min : 4294967294 srun: cpu_freq_max : 4294967294 srun: cpu_freq_gov : 4294967294 srun: switches : -1 srun: wait-for-switches : -1 srun: distribution : unknown srun: cpu_bind : default (0) srun: mem_bind : default (0) srun: verbose : 1 srun: slurmd_debug : 0 srun: immediate : false srun: label output : false srun: unbuffered IO : false srun: overcommit : false srun: threads : 60 srun: checkpoint_dir : /var/slurm/checkpoint srun: wait : 0 srun: nice : -2 srun: account : (null) srun: comment : (null) srun: dependency : (null) srun: exclusive : false srun: bcast : false srun: qos : (null) srun: constraints : srun: geometry : (null) srun: reboot : yes srun: rotate : no srun: preserve_env : false srun: network : (null) srun: propagate : NONE srun: prolog : (null) srun: epilog : (null) srun: mail_type : NONE srun: mail_user : (null) srun: task_prolog : (null) srun: task_epilog : (null) srun: multi_prog : no srun: sockets-per-node : -2 srun: cores-per-socket : -2 srun: threads-per-core : -2 srun: ntasks-per-node : -2 srun: ntasks-per-socket : -2 srun: ntasks-per-core : -2 srun: plane_size : 4294967294 srun: core-spec : NA srun: power : srun: remote command : `hostname' srun: Waiting for nodes to boot (delay looping 450 times @ 0.100000 secs x index) srun: Nodes ####### are ready for job srun: jobid 50871: nodes(1):`#######', cpu counts: 64(x1) srun: launching 50871.0 on host #######, 1 tasks: 0 srun: route default plugin loaded srun: error: timeout waiting for task launch, started 0 of 1 tasks srun: Job step 50871.0 aborted before step completely launched. srun: Job step aborted: Waiting up to 32 seconds for job step to finish. srun: error: Timed out waiting for job step to complete # Rgds