Couple of things to try to locate the issue: 1. To make sure that PMI is not working: have you tried to run something simple (like hello_world ( https://github.com/open-mpi/ompi/blob/master/examples/hello_c.c) and ring ( https://github.com/open-mpi/ompi/blob/master/examples/ring_c.c). Please try to run those two and post the results. 2. If hello is working and ring is not can you try to change the fabric to TCP: $ export OMPI_MCA_btl=tcp,self $ export OMPI_MCA_pml=ob1 $ srun ...
Please provide the outputs 2017-12-07 10:05 GMT-08:00 Glenn (Gedaliah) Wolosh <gwol...@njit.edu>: > > > On Dec 7, 2017, at 12:51 PM, Artem Polyakov <artpo...@gmail.com> wrote: > > also please post the output of > $ srun --mpi=list > > > [gwolosh@p-slogin bin]$ srun --mpi=list > srun: MPI types are... > srun: mpi/mpich1_shmem > srun: mpi/mpich1_p4 > srun: mpi/lam > srun: mpi/openmpi > srun: mpi/none > srun: mpi/mvapich > srun: mpi/mpichmx > srun: mpi/pmi2 > srun: mpi/mpichgm > > > > When job crashes - is there any error messages in the relevant > slurmd.log's or output on the screen? > > > on screen — > > [snode4][[274,1],24][connect/btl_openib_connect_udcm.c: > 1448:udcm_wait_for_send_completion] send failed with verbs status 2 > [snode4:5175] *** An error occurred in MPI_Bcast > [snode4:5175] *** reported by process [17956865,24] > [snode4:5175] *** on communicator MPI_COMM_WORLD > [snode4:5175] *** MPI_ERR_OTHER: known error not in list > [snode4:5175] *** MPI_ERRORS_ARE_FATAL (processes in this communicator > will now abort, > [snode4:5175] *** and potentially your MPI job) > mlx4: local QP operation err (QPN 0005f3, WQE index 40000, vendor syndrome > 6c, opcode = 5e) > srun: Job step aborted: Waiting up to 32 seconds for job step to finish. > [snode4][[274,1],31][connect/btl_openib_connect_udcm.c: > 1448:udcm_wait_for_send_completion] send failed with verbs status 2 > slurmstepd: error: *** STEP 274.0 ON snode1 CANCELLED AT > 2017-12-07T12:55:46 *** > [snode4:5182] *** An error occurred in MPI_Bcast > [snode4:5182] *** reported by process [17956865,31] > [snode4:5182] *** on communicator MPI_COMM_WORLD > [snode4:5182] *** MPI_ERR_OTHER: known error not in list > [snode4:5182] *** MPI_ERRORS_ARE_FATAL (processes in this communicator > will now abort, > [snode4:5182] *** and potentially your MPI job) > mlx4: local QP operation err (QPN 0005f7, WQE index 40000, vendor syndrome > 6c, opcode = 5e) > [snode4][[274,1],27][connect/btl_openib_connect_udcm.c: > 1448:udcm_wait_for_send_completion] send failed with verbs status 2 > [snode4:5178] *** An error occurred in MPI_Bcast > [snode4:5178] *** reported by process [17956865,27] > [snode4:5178] *** on communicator MPI_COMM_WORLD > [snode4:5178] *** MPI_ERR_OTHER: known error not in list > [snode4:5178] *** MPI_ERRORS_ARE_FATAL (processes in this communicator > will now abort, > [snode4:5178] *** and potentially your MPI job) > mlx4: local QP operation err (QPN 0005fa, WQE index 40000, vendor syndrome > 6c, opcode = 5e) > srun: error: snode4: tasks 24,31: Exited with exit code 16 > srun: error: snode4: tasks 25-30: Killed > srun: error: snode5: tasks 32-39: Killed > srun: error: snode3: tasks 16-23: Killed > srun: error: snode8: tasks 56-63: Killed > srun: error: snode7: tasks 48-55: Killed > srun: error: snode1: tasks 0-7: Killed > srun: error: snode2: tasks 8-15: Killed > srun: error: snode6: tasks 40-47: Killed > > Nothing striking in the slurmd log > > > > 2017-12-07 9:49 GMT-08:00 Artem Polyakov <artpo...@gmail.com>: > >> Hello, >> >> what is the value of MpiDefault option in your Slurm configuration file? >> >> 2017-12-07 9:37 GMT-08:00 Glenn (Gedaliah) Wolosh <gwol...@njit.edu>: >> >>> Hello >>> >>> This is using Slurm version - 17.02.6 running on Scientific Linux >>> release 7.4 (Nitrogen) >>> >>> [gwolosh@p-slogin bin]$ module li >>> >>> Currently Loaded Modules: >>> 1) GCCcore/.5.4.0 (H) 2) binutils/.2.26 (H) 3) GCC/5.4.0-2.26 4) >>> numactl/2.0.11 5) hwloc/1.11.3 6) OpenMPI/1.10.3 >>> >>> If I run >>> >>> srun --nodes=8 --ntasks-per-node=8 --ntasks=64 ./ep.C.64 >>> >>> It runs successfuly but I get a message — >>> >>> PMI2 initialized but returned bad values for size/rank/jobid. >>> This is symptomatic of either a failure to use the >>> "--mpi=pmi2" flag in SLURM, or a borked PMI2 installation. >>> If running under SLURM, try adding "-mpi=pmi2" to your >>> srun command line. If that doesn't work, or if you are >>> not running under SLURM, try removing or renaming the >>> pmi2.h header file so PMI2 support will not automatically >>> be built, reconfigure and build OMPI, and then try again >>> with only PMI1 support enabled. >>> >>> If I run >>> >>> srun --nodes=8 --ntasks-per-node=8 --ntasks=64 —mpi=pmi2 ./ep.C.64 >>> >>> The job crashes >>> >>> If I run via sbatch — >>> >>> #!/bin/bash >>> # Job name: >>> #SBATCH --job-name=nas_bench >>> #SBATCH --nodes=8 >>> #SBATCH --ntasks=64 >>> #SBATCH --ntasks-per-node=8 >>> #SBATCH --time=48:00:00 >>> #SBATCH --output=nas.out.1 >>> # >>> ## Command(s) to run (example): >>> module use $HOME/easybuild/modules/all/Core >>> module load GCC/5.4.0-2.26 OpenMPI/1.10.3 >>> mpirun -np 64 ./ep.C.64 >>> >>> the job crashes >>> >>> Using easybuild, these are my config options for ompi — >>> >>> configopts = '--with-threads=posix --enable-shared >>> --enable-mpi-thread-multiple --with-verbs ' >>> configopts += '--enable-mpirun-prefix-by-default ' # suppress failure >>> modes in relation to mpirun path >>> configopts += '--with-hwloc=$EBROOTHWLOC ' # hwloc support >>> configopts += '--disable-dlopen ' # statically link component, don't do >>> dynamic loading >>> configopts += '--with-slurm --with-pmi ‘ >>> >>> And finally — >>> >>> $ ldd >>> /opt/local/easybuild/software/Compiler/GCC/5.4.0-2.26/OpenMPI/1.10.3/bin/orterun >>> | grep pmi >>> libpmi.so.0 => /usr/lib64/libpmi.so.0 (0x00007f0129d6d000) >>> libpmi2.so.0 => /usr/lib64/libpmi2.so.0 (0x00007f0129b51000) >>> >>> $ ompi_info | grep pmi >>> MCA db: pmi (MCA v2.0.0, API v1.0.0, Component v1.10.3) >>> MCA ess: pmi (MCA v2.0.0, API v3.0.0, Component v1.10.3) >>> MCA grpcomm: pmi (MCA v2.0.0, API v2.0.0, Component v1.10.3) >>> MCA pubsub: pmi (MCA v2.0.0, API v2.0.0, Component v1.10.3) >>> >>> >>> Any suggestions? >>> _______________ >>> Gedaliah Wolosh >>> IST Academic and Research Computing Systems (ARCS) >>> NJIT >>> GITC 2203 >>> 973 596 5437 <(973)%20596-5437> >>> gwol...@njit.edu >>> >>> >> >> >> -- >> С Уважением, Поляков Артем Юрьевич >> Best regards, Artem Y. Polyakov >> > > > > -- > С Уважением, Поляков Артем Юрьевич > Best regards, Artem Y. Polyakov > > > -- С Уважением, Поляков Артем Юрьевич Best regards, Artem Y. Polyakov