Hi Chris, What’s happening is that there’s no SLURM_JOBID (my speculation since I don’t have perms to use –no-alloc) is set, but SLURM_NODELIST may be set, so its confusing ORTE. Could you list which SLURM env variables are set in the shell in which your running the srun command?
Howard From: slurm-users <slurm-users-boun...@lists.schedmd.com> on behalf of "O'Grady, Paul Christopher" <c...@slac.stanford.edu> Reply-To: Slurm User Community List <slurm-users@lists.schedmd.com> Date: Monday, March 8, 2021 at 2:09 PM To: "slurm-users@lists.schedmd.com" <slurm-users@lists.schedmd.com> Subject: [EXTERNAL] [slurm-users] --no-alloc breaks mpi? Hi, I’m having an issue with srun's --no-alloc flag with mpi which I can reproduce with a fairly simple example. When I run a simple one-core mpi test program as “slurmUser” (the account that has the --no-alloc privilege) it succeeds: srun -p psfehq -n 1 -o logs/test.log -w psana1507 python ~/ipsana/mpi_simpletest.py However when I add the --no-alloc flag it fails in a way that appears to break mpi (see logfile output and other slurm/mpi info below). It fails similarly on 2 cores. srun --no-alloc -p psfehq -n 1 -o logs/test.log -w psana1507 python ~/ipsana/mpi_simpletest.py srun: do not allocate resources srun: error: psana1507: task 0: Exited with exit code 1 Would anyone have any suggestions for how I could make the “--no-alloc” flag work with mpi? Thanks! chris ------------------------------------------------------------------------------------------------------ Logfile error with --no-alloc flag: (ana-4.0.12) psanagpu105:batchtest_slurm$ more logs/test.log -------------------------------------------------------------------------- The application appears to have been direct launched using "srun", but OMPI was not built with SLURM support. This usually happens when OMPI was not configured --with-slurm and we weren't able to discover a SLURM installation in the usual places. Please configure as appropriate and try again. -------------------------------------------------------------------------- *** An error occurred in MPI_Init_thread *** on a NULL communicator *** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort, *** and potentially your MPI job) [psana1507:13884] Local abort before MPI_INIT completed completed successfully, but am not able to aggregate error messages, and not able to guarantee that all other processes were killed! (ana-4.0.12) psanagpu105:batchtest_slurm$ System information: (ana-4.0.12) psanagpu105:batchtest_slurm$ conda list | grep mpi mpi 1.0 openmpi conda-forge mpi4py 3.0.3 py27h9ab638b_1 conda-forge openmpi 4.1.0 h9b22176_1 conda-forge (ana-4.0.12) psanagpu105:batchtest_slurm$ srun --mpi=list srun: MPI types are... srun: cray_shasta srun: none srun: pmi2 srun: pmix srun: pmix_v3 (ana-4.0.12) psanagpu105:batchtest_slurm$ srun --version slurm 20.11.3 (ana-4.0.12) psanagpu105:batchtest_slurm$