Hi Em,

this is most probably because in Slurm version 20.11 the behaviour of srun was 
changed to not allow job steps to overlap by default any more.

An interactive job launched by `srun --pty bash´ always creates a regular 
step (step <jobid>.0), so mpirun or srun will hang when trying to launch 
another 
job step from within this interactive job step as they would overlap. 

You could try using the --overlap flag or `export SLURM_OVERLAP=1´
before running your interactive job to revert to the previous behavior
that allows steps to overlap. 

However, instead of using `srun --pty bash´ for launching interactive jobs, it 
is now recommended to use `salloc´ and have 
`LaunchParameters=use_interactive_step´ 
set in slurm.conf. 

`salloc´ with `LaunchParameters=use_interactive_step´ enabled will
create a special interactive step (step <jobid>.interactive) that does not 
consume any resources and, thus, does not interfere with a new job step 
launched from within this special interactive job step.

Hope this helps.

Best regards
Jürgen


* Em Dragowsky <dragow...@case.edu> [221102 15:46]:
> Greetings --
> 
> When we started using Slurm some years ago, obtaining the interactive
> resources through "srun ... --pty bash" was the standard that we adopted.
> We are now running Slurm v22.05 (happily), though we noticed recently some
> limitations when claiming resources to demonstrate or develop in an mpi
> environment.  A colleague today was revisiting a finding dating back to
> January, which is:
> 
> I am having issues running interactive MPI jobs in a traditional way. It
> > just stays there without execution.
> >
> > srun -N 2 -n 4 --mem=4gb --pty bash
> > mpirun -n 4 ~/prime-mpi
> >
> > Hower, it does run with:
> > srun -N 2 -n 4 --mem=4gb  ~/prime-mpi
> >
> 
> As indicated, the first approach, taking the resources to test/demo MPI
> jobs via "srun ...  --pty bash" no longer supports the launching of the
> job.  We also checked the srun environment using verbosity, and found that
> the job steps are executed and terminate before the prompt is achieved in
> the requested shell.
> 
> While we infer that changes were implemented, would someone be able to
> direct us to documentation or a discussion as to the changes, and the
> motivation?  We do not doubt that there is compelling motivation, we ask to
> improve our understanding.  As was summarized in and shared amongst our
> team following our review of the current operational behaviour:
> 
> >
> >    - "srun ... executable" works fine
> >    - "salloc -n4", "ssh <node>", "srun -n4 <executable>" works
> >    Using "mpirun -n4 <executable>" does not work
> >    - In batch mode, both mpirun and srun work.
> >
> >
> Thanks to any and all who take the time to shed light on this matter.
> 
> 
> -- 
> E.M. (Em) Dragowsky, Ph.D.
> Research Computing -- UTech
> Case Western Reserve University
> (216) 368-0082
> they/them

Attachment: smime.p7s
Description: S/MIME cryptographic signature

Reply via email to