[slurm-users] Submitting jobs across multiple nodes fails

Andrej Prsa Mon, 01 Feb 2021 21:17:03 -0800

Dear list,

I'm struggling with what seems to be very similar to this thread:


https://lists.schedmd.com/pipermail/slurm-users/2019-July/003746.html

I'm using slurm 20.11.3 patched with this fix to detect pmixv4:

    https://bugs.schedmd.com/show_bug.cgi?id=10683

and this is what I'm seeing:

andrej@terra:~$ salloc -N 2 -n 2
salloc: Granted job allocation 841
andrej@terra:~$ srun hostname

srun: launch/slurm: launch_p_step_launch: StepId=841.0 aborted beforestep completely launched.

srun: Job step aborted: Waiting up to 32 seconds for job step to finish.
srun: error: task 0 launch failed: Unspecified error
srun: error: task 1 launch failed: Unspecified error

In slurmctld.log I have this:

[2021-02-01T23:58:13.683] sched: _slurm_rpc_allocate_resources JobId=841NodeList=node[9-10] usec=572[2021-02-01T23:58:19.817] error: mpi_hook_slurmstepd_prefork failure for0x557e7480bcb0s on node9[2021-02-01T23:58:19.829] error: mpi_hook_slurmstepd_prefork failure for0x55f568e00cb0s on node10


and in slurmd.log I have this for node9:

[2021-02-01T23:58:19.788] launch task StepId=841.0 request from UID:1000GID:1000 HOST:192.168.1.1 PORT:35508[2021-02-01T23:58:19.789] task/affinity: lllp_distribution: JobId=841implicit auto binding: cores, dist 1[2021-02-01T23:58:19.789] task/affinity: _task_layout_lllp_cyclic:_task_layout_lllp_cyclic[2021-02-01T23:58:19.789] task/affinity: _lllp_generate_cpu_bind:_lllp_generate_cpu_bind jobid [841]: mask_cpu, 0x000000000001000000000001[2021-02-01T23:58:19.814] [841.0] error: node9 [0] pmixp_utils.c:108[pmixp_usock_create_srv] mpi/pmix: ERROR: Cannot bind() UNIX socket/var/spool/slurmd/stepd.slurm.pmix.841.0: Address already in use (98)[2021-02-01T23:58:19.814] [841.0] error: node9 [0] pmixp_server.c:387[pmixp_stepd_init] mpi/pmix: ERROR: pmixp_usock_create_srv[2021-02-01T23:58:19.814] [841.0] error: (null) [0] mpi_pmix.c:169[p_mpi_hook_slurmstepd_prefork] mpi/pmix: ERROR: pmixp_stepd_init() failed

[2021-02-01T23:58:19.817] [841.0] error: Failed mpi_hook_slurmstepd_prefork

[2021-02-01T23:58:19.845] [841.0] error: job_manager exiting abnormally,rc = -1

[2021-02-01T23:58:19.892] [841.0] done with job

and this for node10:

[2021-02-01T23:58:19.788] launch task StepId=841.0 request from UID:1000GID:1000 HOST:192.168.1.1 PORT:38918[2021-02-01T23:58:19.789] task/affinity: lllp_distribution: JobId=841implicit auto binding: cores, dist 1[2021-02-01T23:58:19.789] task/affinity: _task_layout_lllp_cyclic:_task_layout_lllp_cyclic[2021-02-01T23:58:19.789] task/affinity: _lllp_generate_cpu_bind:_lllp_generate_cpu_bind jobid [841]: mask_cpu, 0x000000000001000000000001[2021-02-01T23:58:19.825] [841.0] error: node10 [1]pmixp_client_v2.c:246 [pmixp_lib_init] mpi/pmix: ERROR: PMIx_server_initfailed with error -2

: Success (0)

[2021-02-01T23:58:19.826] [841.0] error: node10 [1] pmixp_client.c:518[pmixp_libpmix_init] mpi/pmix: ERROR: PMIx_server_init failed with error -1

: Success (0)

[2021-02-01T23:58:19.826] [841.0] error: node10 [1] pmixp_server.c:423[pmixp_stepd_init] mpi/pmix: ERROR: pmixp_libpmix_init() failed[2021-02-01T23:58:19.826] [841.0] error: (null) [1] mpi_pmix.c:169[p_mpi_hook_slurmstepd_prefork] mpi/pmix: ERROR: pmixp_stepd_init() failed

[2021-02-01T23:58:19.829] [841.0] error: Failed mpi_hook_slurmstepd_prefork

[2021-02-01T23:58:19.853] [841.0] error: job_manager exiting abnormally,rc = -1

[2021-02-01T23:58:19.899] [841.0] done with job

It seems that the culprit is the bind() failure, but I can't make muchsense of it. I checked that /etc/hosts has everything correct andconsistent with the info in slurm.conf.

Other potentially relevant info: all compute nodes are diskless, theyare pxe-booted from a NAS image and running ubuntu server 20.04. Runningjobs on a single node is fine.


Thanks for any insight and suggestions.

Cheers,
Andrej

[slurm-users] Submitting jobs across multiple nodes fails

Reply via email to