Hi Brian,

try:

export SLURM_OVERLAP=1
export SLURM_WHOLE=1

before your salloc and see if that helps. I have seen some mpi issues that were resolved with that.

Unfortunately no dice:

andrej@terra:~$ export SLURM_OVERLAP=1
andrej@terra:~$ export SLURM_WHOLE=1
andrej@terra:~$ salloc -N2 -n2
salloc: Granted job allocation 864
andrej@terra:~$ srun hostname
srun: launch/slurm: launch_p_step_launch: StepId=864.0 aborted before step completely launched.
srun: Job step aborted: Waiting up to 32 seconds for job step to finish.
srun: error: task 1 launch failed: Unspecified error
srun: error: task 0 launch failed: Unspecified error

You can also try it using just the regular mpirun on the nodes allocated. That will help with a datapoint as well.

Same as above, unfortunately.

_But:_ I can get it to work correctly if I replace MpiDefault=pmix with MpiDefault=none. It looks like there's something amiss with pmix support in slurm?

andrej@terra:~$ salloc -N2 -n2
salloc: Granted job allocation 866
andrej@terra:~$ srun hostname
node11
node10

Cheers,
Andrej

Reply via email to