Par, by 'poking around' Crhis means to use tools such as netstat and lsof. Also I would look as ps -eaf --forest to make sure there are no 'orphaned' jusbs sitting on that compute node.
Having said that though, I have a dim memory of a classic PBSPro error message which says something about a network connection, but really means that you cannot open a remote session on that compute server. As an aside, you have checked that your username exists on that compue server? getent passwd par Also that your home directory is mounted - or something substituting for your home directory? On Fri, 12 Jul 2019 at 15:55, Chris Samuel <ch...@csamuel.org> wrote: > On 12/7/19 7:39 am, Pär Lundö wrote: > > > Presumably, the first 8 tasks originates from the first node (in this > > case the lxclient11), and the other node (lxclient10) response as > > predicted. > > That looks right, it seems the other node has two processes fighting > over the same socket and that's breaking Slurm there. > > > Is it neccessary to have passwordless ssh communication alongside the > > munge authentication? > > No, srun doesn't need (or use) that at all. > > > In addition I checked the slurmctld-log from both the server and client > > and found something (noted in bold): > > This is from the slurmd log on the client from the look of it. > > > *[2019-07-12T14:57:53.771][83.0] task_p_pre_launch: Using sched affinity > > for tasks lurm.pmix.83.0: Address already in use[98]* > > [2019-07-12T14:57:53.682][83.0] error: lxclient[0] /pmix.server.c:386 > > [pmix_stepd_init] mpi/pmix: ERROR: pmixp_usock_create_srv > > [2019-07-12T14:57:53.683][83.0] error: (null) [0] /mpi_pmix:156 > > [p_mpi_hook_slurmstepd_prefork] mpi/pmix: ERROR: pmixp_stepd_init() > failed > > That indicates that something else has grabbed the socket it wants and > that's why the setup of the MPI ranks on the second node fails. > > You'll want to poke around there to see what's using it. > > Best of luck! > Chris > -- > Chris Samuel : http://www.csamuel.org/ : Berkeley, CA, USA > >