Re: [slurm-users] Submitting jobs across multiple nodes fails

2021-02-04 Thread Andrej Prsa
Hi Brian, try: export SLURM_OVERLAP=1 export SLURM_WHOLE=1 before your salloc and see if that helps. I have seen some mpi issues that were resolved with that. Unfortunately no dice: andrej@terra:~$ export SLURM_OVERLAP=1 andrej@terra:~$ export SLURM_WHOLE=1 andrej@terra:~$ salloc -N2 -n2 s

Re: [slurm-users] Submitting jobs across multiple nodes fails

2021-02-04 Thread Brian Andrus
try: export SLURM_OVERLAP=1 export SLURM_WHOLE=1 before your salloc and see if that helps. I have seen some mpi issues that were resolved with that. You can also try it using just the regular mpirun on the nodes allocated. That will help with a datapoint as well. Brian Andrus On 2/4/2021

Re: [slurm-users] Submitting jobs across multiple nodes fails

2021-02-04 Thread Andrej Prsa
Hi Brian, Thanks for your response! Did you compile slurm with mpi support? Yep: andrej@terra:~$ srun --mpi=list srun: MPI types are... srun: cray_shasta srun: none srun: pmi2 srun: pmix srun: pmix_v4 Your mpi libraries should be the same as that version and they should be available in th

Re: [slurm-users] Submitting jobs across multiple nodes fails

2021-02-04 Thread Brian Andrus
Did you compile slurm with mpi support? Your mpi libraries should be the same as that version and they should be available in the same locations for all nodes. Also, ensure they are accessible (PATH, LD_LIBRARY_PATH, etc are set) Brian Andrus On 2/4/2021 1:20 PM, Andrej Prsa wrote: Gentle bum

Re: [slurm-users] Submitting jobs across multiple nodes fails

2021-02-04 Thread Andrej Prsa
Gentle bump on this, if anyone has suggestions as I weed through the scattered slurm docs. :) Thanks, Andrej On February 2, 2021 00:14:37 Andrej Prsa wrote: Dear list, I'm struggling with what seems to be very similar to this thread: https://lists.schedmd.com/pipermail/slurm-users/2019-Jul

[slurm-users] Submitting jobs across multiple nodes fails

2021-02-01 Thread Andrej Prsa
Dear list, I'm struggling with what seems to be very similar to this thread: https://lists.schedmd.com/pipermail/slurm-users/2019-July/003746.html I'm using slurm 20.11.3 patched with this fix to detect pmixv4:     https://bugs.schedmd.com/show_bug.cgi?id=10683 and this is what I'm seeing: a