Jürgen, >> does it work with `srun --overlap ...´ or if you do `export SLURM_OVERLAP=1´ >> before running your interactive job?
I performed testing yesterday while using the "--overlap" flag, but that didn't do anything. But, exporting the variable instead seems to have corrected the issue: > $ srun --ntasks-per-node=1 -N 8 --qos=devel --partition=devel -t 00:15:00 > --pty /bin/bash > srun: job 1034834 queued and waiting for resources > srun: job 1034834 has been allocated resources > [~] $ env|grep SLURM_OVER > SLURM_OVERLAP=1 > [~] $ module list > Currently Loaded Modulefiles: > 1) mpi/openmpi/3.1.6 > [~] $ mpirun -n $SLURM_NTASKS uptime > 11:32:55 up 365 days, 22:22, 0 users, load average: 0.00, 0.01, 0.05 > 11:32:55 up 215 days, 8:42, 0 users, load average: 0.03, 0.03, 0.05 > 11:32:55 up 365 days, 22:22, 0 users, load average: 0.00, 0.01, 0.05 > 11:32:55 up 365 days, 22:21, 0 users, load average: 0.06, 0.05, 0.05 > 11:32:55 up 365 days, 22:22, 0 users, load average: 0.06, 0.05, 0.05 > 11:32:55 up 365 days, 22:22, 0 users, load average: 0.04, 0.04, 0.05 > 11:32:55 up 365 days, 22:22, 0 users, load average: 0.06, 0.04, 0.05 > 11:32:55 up 365 days, 22:22, 0 users, load average: 0.04, 0.04, 0.05 > [~] $ mpirun -n $SLURM_NTASKS hostname > mdc-1057-30-1 > mdc-1057-30-3 > mdc-1057-30-7 > mdc-1057-30-4 > mdc-1057-30-8 > mdc-1057-30-5 > mdc-1057-30-2 > mdc-1057-30-6 Thanks for that suggestion! I imagine that this could be a bug then, that is specifying "--overlap" with `srun` having no effect while manually setting the variable does. John DeSantis On 4/28/21 11:27 AM, Juergen Salk wrote: > Hi John, > > does it work with `srun --overlap ...´ or if you do `export SLURM_OVERLAP=1´ > before running your interactive job? > > Best regards > Jürgen > > > * John DeSantis <desan...@usf.edu> [210428 09:41]: >> Hello all, >> >> Just an update, the following URL almost mirrors the issue we're seeing: >> https://nam04.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fopen-mpi%2Fompi%2Fissues%2F8378&data=04%7C01%7Cdesantis%40usf.edu%7Cc426cb57da684f79321a08d90a5a4a41%7C741bf7dee2e546df8d6782607df9deaa%7C0%7C0%7C637552205190034373%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=Gk9hwFGRbfAGFx48EGdeB3JFyHNdEmtqgfM9kRife0U%3D&reserved=0 >> >> But, SLURM 20.11.3 was shipped with the fix. I've verified that the changes >> are in the source code. >> >> We don't want to have to downgrade SLURM to 20.02.x, but it seems that this >> behaviour still exists. Are no other sites on fresh installs of >= SLURM >> 20.11.3 experiencing this problem? >> >> I was aware of the changes in 20.11.{0..2} which received a lot of scrunity, >> which is why 20.11.3 was selected. >> >> Thanks, >> John DeSantis >> >> On 4/26/21 5:12 PM, John DeSantis wrote: >>> Hello all, >>> >>> We've recently (don't laugh!) updated two of our SLURM installations from >>> 16.05.10-2 to 20.11.3 and 17.11.9, respectively. Now, OpenMPI doesn't seem >>> to function in interactive mode across multiple nodes as it did previously >>> on the latest version 20.11.3; using `srun` and `mpirun` on a single node >>> gives desired results, while using multiple nodes causes a hang. Jobs >>> submitted via `sbatch` do _work as expected_. >>> >>> [desantis@sclogin0 ~]$ scontrol show config |grep VERSION; srun -n 2 -N 2-2 >>> -t 00:05:00 --pty /bin/bash >>> SLURM_VERSION = 17.11.9 >>> [desantis@sccompute0 ~]$ for OPENMPI in mpi/openmpi/1.8.5 mpi/openmpi/2.0.4 >>> mpi/openmpi/2.0.4-psm2 mpi/openmpi/2.1.6 mpi/openmpi/3.1.6 >>> compilers/intel/2020_cluster_xe; do module load $OPENMPI ; which mpirun; >>> mpirun hostname; module purge; echo; done >>> /apps/openmpi/1.8.5/bin/mpirun >>> sccompute0 >>> sccompute1 >>> >>> /apps/openmpi/2.0.4/bin/mpirun >>> sccompute1 >>> sccompute0 >>> >>> /apps/openmpi/2.0.4-psm2/bin/mpirun >>> sccompute1 >>> sccompute0 >>> >>> /apps/openmpi/2.1.6/bin/mpirun >>> sccompute0 >>> sccompute1 >>> >>> /apps/openmpi/3.1.6/bin/mpirun >>> sccompute0 >>> sccompute1 >>> >>> /apps/intel/2020_u2/compilers_and_libraries_2020.2.254/linux/mpi/intel64/bin/mpirun >>> sccompute1 >>> sccompute0 >>> >>> >>> 15:58:28 Mon Apr 26 <0> >>> desantis@itn0 >>> [~] $ scontrol show config|grep VERSION; srun -n 2 -N 2-2 --qos=devel >>> --partition=devel -t 00:05:00 --pty /bin/bash >>> SLURM_VERSION = 20.11.3 >>> srun: job 1019599 queued and waiting for resources >>> srun: job 1019599 has been allocated resources >>> 15:58:46 Mon Apr 26 <0> >>> desantis@mdc-1057-30-1 >>> [~] $ for OPENMPI in mpi/openmpi/1.8.5 mpi/openmpi/2.0.4 >>> mpi/openmpi/2.0.4-psm2 mpi/openmpi/2.1.6 mpi/openmpi/3.1.6 >>> compilers/intel/2020_cluster_xe; do module load $OPENMPI ; which mpirun; >>> mpirun hostname; module purge; echo; done >>> /apps/openmpi/1.8.5/bin/mpirun >>> ^C >>> /apps/openmpi/2.0.4/bin/mpirun >>> ^C >>> /apps/openmpi/2.0.4-psm2/bin/mpirun >>> ^C >>> /apps/openmpi/2.1.6/bin/mpirun >>> ^C >>> /apps/openmpi/3.1.6/bin/mpirun >>> ^C >>> /apps/intel/2020_u2/compilers_and_libraries_2020.2.254/linux/mpi/intel64/bin/mpirun >>> ^C[mpiexec@mdc-1057-30-1] Sending Ctrl-C to processes as requested >>> [mpiexec@mdc-1057-30-1] Press Ctrl-C again to force abort >>> ^C >>> >>> Our SLURM installations are fairly straight forward. We `rpmbuild` >>> directly from the bzip2 files without any additional arguments. We've done >>> this since we first started using SLURM with version 14.03.3-2 and through >>> all upgrades. Due to SLURM's awesomeness(!), we've simply used the same >>> configuration files between version changes, with the only changes being >>> made to parameters which have been deprecated/renamed. Our >>> "Mpi{Default,Params}" have always been sent to "none". The only real >>> difference we're able to ascertain is that the MPI plugin for openmpi has >>> been removed. >>> >>> svc-3024-5-2: SLURM_VERSION = 16.05.10-2 >>> svc-3024-5-2: srun: MPI types are... >>> svc-3024-5-2: srun: mpi/openmpi >>> svc-3024-5-2: srun: mpi/mpich1_shmem >>> svc-3024-5-2: srun: mpi/mpichgm >>> svc-3024-5-2: srun: mpi/mvapich >>> svc-3024-5-2: srun: mpi/mpich1_p4 >>> svc-3024-5-2: srun: mpi/lam >>> svc-3024-5-2: srun: mpi/none >>> svc-3024-5-2: srun: mpi/mpichmx >>> svc-3024-5-2: srun: mpi/pmi2 >>> >>> viking: SLURM_VERSION = 20.11.3 >>> viking: srun: MPI types are... >>> viking: srun: cray_shasta >>> viking: srun: pmi2 >>> viking: srun: none >>> >>> sclogin0: SLURM_VERSION = 17.11.9 >>> sclogin0: srun: MPI types are... >>> sclogin0: srun: openmpi >>> sclogin0: srun: none >>> sclogin0: srun: pmi2 >>> sclogin0: >>> >>> As far as building OpenMPI, we've always withheld any SLURM specific flags, >>> i.e. "--with-slurm", although during the build process SLURM is detected. >>> >>> Because OpenMPI was always built using this method, we never had to >>> recompile OpenMPI after subsequent SLURM upgrades, and no cluster ready >>> applications had to be rebuilt. The only time OpenMPI had to be rebuilt >>> was due to OPA hardware which was a simple addition of the "--with-psm2" >>> flag. >>> >>> It is my understanding that the openmpi plugin "never really did anything" >>> (per perusing the mailing list), which is why it was removed. Furthermore, >>> searching the mailing list suggests that the appropriate method is to use >>> `salloc` first, despite version 17.11.9 not needing `salloc` for an >>> "interactive" sessions. >>> >>> Before we go further down this rabbit hole, were other sites affected with >>> a transition from SLURM versions 16.x,17.x,18.x(?) to versions 20.x? If >>> so, did the methodology for multinode interactive MPI sessions change? >>> >>> Thanks! >>> John DeSantis >>> >>> >>> >>> >> > > -- > GPG A997BA7A | 87FC DA31 5F00 C885 0DC3 E28F BD0D 4B33 A997 BA7A > > [EXTERNAL EMAIL] DO NOT CLICK links or attachments unless you recognize the > sender and know the content is safe. >