Re: [slurm-users] Submitting jobs across multiple nodes fails

Brian Andrus Thu, 04 Feb 2021 14:49:01 -0800

Did you compile slurm with mpi support?

Your mpi libraries should be the same as that version and they should beavailable in the same locations for all nodes.

Also, ensure they are accessible (PATH, LD_LIBRARY_PATH, etc are set)


Brian Andrus

On 2/4/2021 1:20 PM, Andrej Prsa wrote:

Gentle bump on this, if anyone has suggestions as I weed through thescattered slurm docs. :)


Thanks,
Andrej

On February 2, 2021 00:14:37 Andrej Prsa <aprs...@gmail.com> wrote:

Dear list,

I'm struggling with what seems to be very similar to this thread:

https://lists.schedmd.com/pipermail/slurm-users/2019-July/003746.html

I'm using slurm 20.11.3 patched with this fix to detect pmixv4:

https://bugs.schedmd.com/show_bug.cgi?id=10683

and this is what I'm seeing:

andrej@terra:~$ salloc -N 2 -n 2
salloc: Granted job allocation 841
andrej@terra:~$ srun hostname
srun: launch/slurm: launch_p_step_launch: StepId=841.0 aborted before
step completely launched.
srun: Job step aborted: Waiting up to 32 seconds for job step to finish.
srun: error: task 0 launch failed: Unspecified error
srun: error: task 1 launch failed: Unspecified error

In slurmctld.log I have this:

[2021-02-01T23:58:13.683] sched: _slurm_rpc_allocate_resources JobId=841
NodeList=node[9-10] usec=572
[2021-02-01T23:58:19.817] error: mpi_hook_slurmstepd_prefork failure for
0x557e7480bcb0s on node9
[2021-02-01T23:58:19.829] error: mpi_hook_slurmstepd_prefork failure for
0x55f568e00cb0s on node10

and in slurmd.log I have this for node9:

[2021-02-01T23:58:19.788] launch task StepId=841.0 request from UID:1000
GID:1000 HOST:192.168.1.1 PORT:35508
[2021-02-01T23:58:19.789] task/affinity: lllp_distribution: JobId=841
implicit auto binding: cores, dist 1
[2021-02-01T23:58:19.789] task/affinity: _task_layout_lllp_cyclic:
_task_layout_lllp_cyclic
[2021-02-01T23:58:19.789] task/affinity: _lllp_generate_cpu_bind:
_lllp_generate_cpu_bind jobid [841]: mask_cpu, 0x000000000001000000000001
[2021-02-01T23:58:19.814] [841.0] error: node9 [0] pmixp_utils.c:108
[pmixp_usock_create_srv] mpi/pmix: ERROR: Cannot bind() UNIX socket
/var/spool/slurmd/stepd.slurm.pmix.841.0: Address already in use (98)
[2021-02-01T23:58:19.814] [841.0] error: node9 [0] pmixp_server.c:387
[pmixp_stepd_init] mpi/pmix: ERROR: pmixp_usock_create_srv
[2021-02-01T23:58:19.814] [841.0] error: (null) [0] mpi_pmix.c:169

[p_mpi_hook_slurmstepd_prefork] mpi/pmix: ERROR: pmixp_stepd_init()failed[2021-02-01T23:58:19.817] [841.0] error: Failedmpi_hook_slurmstepd_prefork

[2021-02-01T23:58:19.845] [841.0] error: job_manager exiting abnormally,
rc = -1
[2021-02-01T23:58:19.892] [841.0] done with job

and this for node10:

[2021-02-01T23:58:19.788] launch task StepId=841.0 request from UID:1000
GID:1000 HOST:192.168.1.1 PORT:38918
[2021-02-01T23:58:19.789] task/affinity: lllp_distribution: JobId=841
implicit auto binding: cores, dist 1
[2021-02-01T23:58:19.789] task/affinity: _task_layout_lllp_cyclic:
_task_layout_lllp_cyclic
[2021-02-01T23:58:19.789] task/affinity: _lllp_generate_cpu_bind:
_lllp_generate_cpu_bind jobid [841]: mask_cpu, 0x000000000001000000000001
[2021-02-01T23:58:19.825] [841.0] error: node10 [1]
pmixp_client_v2.c:246 [pmixp_lib_init] mpi/pmix: ERROR: PMIx_server_init
failed with error -2
: Success (0)
[2021-02-01T23:58:19.826] [841.0] error: node10 [1] pmixp_client.c:518

[pmixp_libpmix_init] mpi/pmix: ERROR: PMIx_server_init failed witherror -1

: Success (0)
[2021-02-01T23:58:19.826] [841.0] error: node10 [1] pmixp_server.c:423
[pmixp_stepd_init] mpi/pmix: ERROR: pmixp_libpmix_init() failed
[2021-02-01T23:58:19.826] [841.0] error: (null) [1] mpi_pmix.c:169

[p_mpi_hook_slurmstepd_prefork] mpi/pmix: ERROR: pmixp_stepd_init()failed[2021-02-01T23:58:19.829] [841.0] error: Failedmpi_hook_slurmstepd_prefork

[2021-02-01T23:58:19.853] [841.0] error: job_manager exiting abnormally,
rc = -1
[2021-02-01T23:58:19.899] [841.0] done with job

It seems that the culprit is the bind() failure, but I can't make much
sense of it. I checked that /etc/hosts has everything correct and
consistent with the info in slurm.conf.

Other potentially relevant info: all compute nodes are diskless, they
are pxe-booted from a NAS image and running ubuntu server 20.04. Running
jobs on a single node is fine.

Thanks for any insight and suggestions.

Cheers,
Andrej

Re: [slurm-users] Submitting jobs across multiple nodes fails

Reply via email to