Hi Brian,
try:
export SLURM_OVERLAP=1
export SLURM_WHOLE=1
before your salloc and see if that helps. I have seen some mpi issues
that were resolved with that.
Unfortunately no dice:
andrej@terra:~$ export SLURM_OVERLAP=1
andrej@terra:~$ export SLURM_WHOLE=1
andrej@terra:~$ salloc -N2 -n2
salloc: Granted job allocation 864
andrej@terra:~$ srun hostname
srun: launch/slurm: launch_p_step_launch: StepId=864.0 aborted before
step completely launched.
srun: Job step aborted: Waiting up to 32 seconds for job step to finish.
srun: error: task 1 launch failed: Unspecified error
srun: error: task 0 launch failed: Unspecified error
You can also try it using just the regular mpirun on the nodes
allocated. That will help with a datapoint as well.
Same as above, unfortunately.
_But:_ I can get it to work correctly if I replace MpiDefault=pmix with
MpiDefault=none. It looks like there's something amiss with pmix support
in slurm?
andrej@terra:~$ salloc -N2 -n2
salloc: Granted job allocation 866
andrej@terra:~$ srun hostname
node11
node10
Cheers,
Andrej