Dear list,

I'm struggling with what seems to be very similar to this thread:

https://lists.schedmd.com/pipermail/slurm-users/2019-July/003746.html

I'm using slurm 20.11.3 patched with this fix to detect pmixv4:

    https://bugs.schedmd.com/show_bug.cgi?id=10683

and this is what I'm seeing:

andrej@terra:~$ salloc -N 2 -n 2
salloc: Granted job allocation 841
andrej@terra:~$ srun hostname
srun: launch/slurm: launch_p_step_launch: StepId=841.0 aborted before step completely launched.
srun: Job step aborted: Waiting up to 32 seconds for job step to finish.
srun: error: task 0 launch failed: Unspecified error
srun: error: task 1 launch failed: Unspecified error

In slurmctld.log I have this:

[2021-02-01T23:58:13.683] sched: _slurm_rpc_allocate_resources JobId=841 NodeList=node[9-10] usec=572 [2021-02-01T23:58:19.817] error: mpi_hook_slurmstepd_prefork failure for 0x557e7480bcb0s on node9 [2021-02-01T23:58:19.829] error: mpi_hook_slurmstepd_prefork failure for 0x55f568e00cb0s on node10

and in slurmd.log I have this for node9:

[2021-02-01T23:58:19.788] launch task StepId=841.0 request from UID:1000 GID:1000 HOST:192.168.1.1 PORT:35508 [2021-02-01T23:58:19.789] task/affinity: lllp_distribution: JobId=841 implicit auto binding: cores, dist 1 [2021-02-01T23:58:19.789] task/affinity: _task_layout_lllp_cyclic: _task_layout_lllp_cyclic [2021-02-01T23:58:19.789] task/affinity: _lllp_generate_cpu_bind: _lllp_generate_cpu_bind jobid [841]: mask_cpu, 0x000000000001000000000001 [2021-02-01T23:58:19.814] [841.0] error: node9 [0] pmixp_utils.c:108 [pmixp_usock_create_srv] mpi/pmix: ERROR: Cannot bind() UNIX socket /var/spool/slurmd/stepd.slurm.pmix.841.0: Address already in use (98) [2021-02-01T23:58:19.814] [841.0] error: node9 [0] pmixp_server.c:387 [pmixp_stepd_init] mpi/pmix: ERROR: pmixp_usock_create_srv [2021-02-01T23:58:19.814] [841.0] error: (null) [0] mpi_pmix.c:169 [p_mpi_hook_slurmstepd_prefork] mpi/pmix: ERROR: pmixp_stepd_init() failed
[2021-02-01T23:58:19.817] [841.0] error: Failed mpi_hook_slurmstepd_prefork
[2021-02-01T23:58:19.845] [841.0] error: job_manager exiting abnormally, rc = -1
[2021-02-01T23:58:19.892] [841.0] done with job

and this for node10:

[2021-02-01T23:58:19.788] launch task StepId=841.0 request from UID:1000 GID:1000 HOST:192.168.1.1 PORT:38918 [2021-02-01T23:58:19.789] task/affinity: lllp_distribution: JobId=841 implicit auto binding: cores, dist 1 [2021-02-01T23:58:19.789] task/affinity: _task_layout_lllp_cyclic: _task_layout_lllp_cyclic [2021-02-01T23:58:19.789] task/affinity: _lllp_generate_cpu_bind: _lllp_generate_cpu_bind jobid [841]: mask_cpu, 0x000000000001000000000001 [2021-02-01T23:58:19.825] [841.0] error: node10 [1] pmixp_client_v2.c:246 [pmixp_lib_init] mpi/pmix: ERROR: PMIx_server_init failed with error -2
: Success (0)
[2021-02-01T23:58:19.826] [841.0] error: node10 [1] pmixp_client.c:518 [pmixp_libpmix_init] mpi/pmix: ERROR: PMIx_server_init failed with error -1
: Success (0)
[2021-02-01T23:58:19.826] [841.0] error: node10 [1] pmixp_server.c:423 [pmixp_stepd_init] mpi/pmix: ERROR: pmixp_libpmix_init() failed [2021-02-01T23:58:19.826] [841.0] error: (null) [1] mpi_pmix.c:169 [p_mpi_hook_slurmstepd_prefork] mpi/pmix: ERROR: pmixp_stepd_init() failed
[2021-02-01T23:58:19.829] [841.0] error: Failed mpi_hook_slurmstepd_prefork
[2021-02-01T23:58:19.853] [841.0] error: job_manager exiting abnormally, rc = -1
[2021-02-01T23:58:19.899] [841.0] done with job

It seems that the culprit is the bind() failure, but I can't make much sense of it. I checked that /etc/hosts has everything correct and consistent with the info in slurm.conf.

Other potentially relevant info: all compute nodes are diskless, they are pxe-booted from a NAS image and running ubuntu server 20.04. Running jobs on a single node is fine.

Thanks for any insight and suggestions.

Cheers,
Andrej


Reply via email to