> On Dec 7, 2017, at 12:51 PM, Artem Polyakov <artpo...@gmail.com> wrote: > > also please post the output of > $ srun --mpi=list
[gwolosh@p-slogin bin]$ srun --mpi=list srun: MPI types are... srun: mpi/mpich1_shmem srun: mpi/mpich1_p4 srun: mpi/lam srun: mpi/openmpi srun: mpi/none srun: mpi/mvapich srun: mpi/mpichmx srun: mpi/pmi2 srun: mpi/mpichgm > > When job crashes - is there any error messages in the relevant slurmd.log's > or output on the screen? on screen — [snode4][[274,1],24][connect/btl_openib_connect_udcm.c:1448:udcm_wait_for_send_completion] send failed with verbs status 2 [snode4:5175] *** An error occurred in MPI_Bcast [snode4:5175] *** reported by process [17956865,24] [snode4:5175] *** on communicator MPI_COMM_WORLD [snode4:5175] *** MPI_ERR_OTHER: known error not in list [snode4:5175] *** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort, [snode4:5175] *** and potentially your MPI job) mlx4: local QP operation err (QPN 0005f3, WQE index 40000, vendor syndrome 6c, opcode = 5e) srun: Job step aborted: Waiting up to 32 seconds for job step to finish. [snode4][[274,1],31][connect/btl_openib_connect_udcm.c:1448:udcm_wait_for_send_completion] send failed with verbs status 2 slurmstepd: error: *** STEP 274.0 ON snode1 CANCELLED AT 2017-12-07T12:55:46 *** [snode4:5182] *** An error occurred in MPI_Bcast [snode4:5182] *** reported by process [17956865,31] [snode4:5182] *** on communicator MPI_COMM_WORLD [snode4:5182] *** MPI_ERR_OTHER: known error not in list [snode4:5182] *** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort, [snode4:5182] *** and potentially your MPI job) mlx4: local QP operation err (QPN 0005f7, WQE index 40000, vendor syndrome 6c, opcode = 5e) [snode4][[274,1],27][connect/btl_openib_connect_udcm.c:1448:udcm_wait_for_send_completion] send failed with verbs status 2 [snode4:5178] *** An error occurred in MPI_Bcast [snode4:5178] *** reported by process [17956865,27] [snode4:5178] *** on communicator MPI_COMM_WORLD [snode4:5178] *** MPI_ERR_OTHER: known error not in list [snode4:5178] *** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort, [snode4:5178] *** and potentially your MPI job) mlx4: local QP operation err (QPN 0005fa, WQE index 40000, vendor syndrome 6c, opcode = 5e) srun: error: snode4: tasks 24,31: Exited with exit code 16 srun: error: snode4: tasks 25-30: Killed srun: error: snode5: tasks 32-39: Killed srun: error: snode3: tasks 16-23: Killed srun: error: snode8: tasks 56-63: Killed srun: error: snode7: tasks 48-55: Killed srun: error: snode1: tasks 0-7: Killed srun: error: snode2: tasks 8-15: Killed srun: error: snode6: tasks 40-47: Killed Nothing striking in the slurmd log > > 2017-12-07 9:49 GMT-08:00 Artem Polyakov <artpo...@gmail.com > <mailto:artpo...@gmail.com>>: > Hello, > > what is the value of MpiDefault option in your Slurm configuration file? > > 2017-12-07 9:37 GMT-08:00 Glenn (Gedaliah) Wolosh <gwol...@njit.edu > <mailto:gwol...@njit.edu>>: > Hello > > This is using Slurm version - 17.02.6 running on Scientific Linux release 7.4 > (Nitrogen) > > [gwolosh@p-slogin bin]$ module li > > Currently Loaded Modules: > 1) GCCcore/.5.4.0 (H) 2) binutils/.2.26 (H) 3) GCC/5.4.0-2.26 4) > numactl/2.0.11 5) hwloc/1.11.3 6) OpenMPI/1.10.3 > > If I run > > srun --nodes=8 --ntasks-per-node=8 --ntasks=64 ./ep.C.64 > > It runs successfuly but I get a message — > > PMI2 initialized but returned bad values for size/rank/jobid. > This is symptomatic of either a failure to use the > "--mpi=pmi2" flag in SLURM, or a borked PMI2 installation. > If running under SLURM, try adding "-mpi=pmi2" to your > srun command line. If that doesn't work, or if you are > not running under SLURM, try removing or renaming the > pmi2.h header file so PMI2 support will not automatically > be built, reconfigure and build OMPI, and then try again > with only PMI1 support enabled. > > If I run > > srun --nodes=8 --ntasks-per-node=8 --ntasks=64 —mpi=pmi2 ./ep.C.64 > > The job crashes > > If I run via sbatch — > > #!/bin/bash > # Job name: > #SBATCH --job-name=nas_bench > #SBATCH --nodes=8 > #SBATCH --ntasks=64 > #SBATCH --ntasks-per-node=8 > #SBATCH --time=48:00:00 > #SBATCH --output=nas.out.1 > # > ## Command(s) to run (example): > module use $HOME/easybuild/modules/all/Core > module load GCC/5.4.0-2.26 OpenMPI/1.10.3 > mpirun -np 64 ./ep.C.64 > > the job crashes > > Using easybuild, these are my config options for ompi — > > configopts = '--with-threads=posix --enable-shared > --enable-mpi-thread-multiple --with-verbs ' > configopts += '--enable-mpirun-prefix-by-default ' # suppress failure modes > in relation to mpirun path > configopts += '--with-hwloc=$EBROOTHWLOC ' # hwloc support > configopts += '--disable-dlopen ' # statically link component, don't do > dynamic loading > configopts += '--with-slurm --with-pmi ‘ > > And finally — > > $ ldd > /opt/local/easybuild/software/Compiler/GCC/5.4.0-2.26/OpenMPI/1.10.3/bin/orterun > | grep pmi > libpmi.so.0 => /usr/lib64/libpmi.so.0 (0x00007f0129d6d000) > libpmi2.so.0 => /usr/lib64/libpmi2.so.0 (0x00007f0129b51000) > > $ ompi_info | grep pmi > MCA db: pmi (MCA v2.0.0, API v1.0.0, Component v1.10.3) > MCA ess: pmi (MCA v2.0.0, API v3.0.0, Component v1.10.3) > MCA grpcomm: pmi (MCA v2.0.0, API v2.0.0, Component v1.10.3) > MCA pubsub: pmi (MCA v2.0.0, API v2.0.0, Component v1.10.3) > > > Any suggestions? > _______________ > Gedaliah Wolosh > IST Academic and Research Computing Systems (ARCS) > NJIT > GITC 2203 > 973 596 5437 <tel:(973)%20596-5437> > gwol...@njit.edu <mailto:gwol...@njit.edu> > > > > > -- > С Уважением, Поляков Артем Юрьевич > Best regards, Artem Y. Polyakov > > > > -- > С Уважением, Поляков Артем Юрьевич > Best regards, Artem Y. Polyakov