Hi, Juergen -- This is really useful information -- thanks for the pointer, and for taking the time to share!
And, Jacob -- can you point us to any primary documentation based on Juergen's observation that the change took place with v20.11? With the emphasis on salloc, I find in the examples: > To get an allocation, and open a new xterm in which srun commands > may be typed interactively: > > $ salloc -N16 xterm > salloc: Granted job allocation 65537 > which works as advertised (I'm not sure that i miss xterms or not -- at least on our cluster we dont configure them explicitly as a primary terminal tool) And thanks also Chris and Jason for the validation and endorsement of these approaches. Best, all! ~ Em On Wed, Nov 2, 2022 at 5:47 PM Juergen Salk <juergen.s...@uni-ulm.de> wrote: > Hi Em, > > this is most probably because in Slurm version 20.11 the behaviour of srun > was > changed to not allow job steps to overlap by default any more. > > An interactive job launched by `srun --pty bash´ always creates a regular > step (step <jobid>.0), so mpirun or srun will hang when trying to launch > another > job step from within this interactive job step as they would overlap. > > You could try using the --overlap flag or `export SLURM_OVERLAP=1´ > before running your interactive job to revert to the previous behavior > that allows steps to overlap. > > However, instead of using `srun --pty bash´ for launching interactive > jobs, it > is now recommended to use `salloc´ and have > `LaunchParameters=use_interactive_step´ > set in slurm.conf. > > `salloc´ with `LaunchParameters=use_interactive_step´ enabled will > create a special interactive step (step <jobid>.interactive) that does not > consume any resources and, thus, does not interfere with a new job step > launched from within this special interactive job step. > > Hope this helps. > > Best regards > Jürgen > > > * Em Dragowsky <dragow...@case.edu> [221102 15:46]: > > Greetings -- > > > > When we started using Slurm some years ago, obtaining the interactive > > resources through "srun ... --pty bash" was the standard that we adopted. > > We are now running Slurm v22.05 (happily), though we noticed recently > some > > limitations when claiming resources to demonstrate or develop in an mpi > > environment. A colleague today was revisiting a finding dating back to > > January, which is: > > > > I am having issues running interactive MPI jobs in a traditional way. It > > > just stays there without execution. > > > > > > srun -N 2 -n 4 --mem=4gb --pty bash > > > mpirun -n 4 ~/prime-mpi > > > > > > Hower, it does run with: > > > srun -N 2 -n 4 --mem=4gb ~/prime-mpi > > > > > > > As indicated, the first approach, taking the resources to test/demo MPI > > jobs via "srun ... --pty bash" no longer supports the launching of the > > job. We also checked the srun environment using verbosity, and found > that > > the job steps are executed and terminate before the prompt is achieved in > > the requested shell. > > > > While we infer that changes were implemented, would someone be able to > > direct us to documentation or a discussion as to the changes, and the > > motivation? We do not doubt that there is compelling motivation, we ask > to > > improve our understanding. As was summarized in and shared amongst our > > team following our review of the current operational behaviour: > > > > > > > > - "srun ... executable" works fine > > > - "salloc -n4", "ssh <node>", "srun -n4 <executable>" works > > > Using "mpirun -n4 <executable>" does not work > > > - In batch mode, both mpirun and srun work. > > > > > > > > Thanks to any and all who take the time to shed light on this matter. > > > > > > -- > > E.M. (Em) Dragowsky, Ph.D. > > Research Computing -- UTech > > Case Western Reserve University > > (216) 368-0082 > > they/them > > -- E.M. (Em) Dragowsky, Ph.D. Research Computing -- UTech Case Western Reserve University (216) 368-0082 they/them