> On Dec 7, 2017, at 1:18 PM, Artem Polyakov <artpo...@gmail.com> wrote: > > Couple of things to try to locate the issue: > > 1. To make sure that PMI is not working: have you tried to run something > simple (like hello_world > (https://github.com/open-mpi/ompi/blob/master/examples/hello_c.c > <https://github.com/open-mpi/ompi/blob/master/examples/hello_c.c>) and ring > (https://github.com/open-mpi/ompi/blob/master/examples/ring_c.c > <https://github.com/open-mpi/ompi/blob/master/examples/ring_c.c>). Please try > to run those two and post the results. > 2. If hello is working and ring is not can you try to change the fabric to > TCP: > $ export OMPI_MCA_btl=tcp,self > $ export OMPI_MCA_pml=ob1 > $ srun ... > > Please provide the outputs
srun --mpi=pmi2 --ntasks-per-node=8 --ntasks=16 ./hello_c > hello_c.out Hello, world, I am 24 of 32, (Open MPI v1.10.3, package: Open MPI gwol...@snode2.p-stheno.tartan.njit.edu Distribution, ident: 1.10.3, repo rev: v1.10.2-251-g9acf492, Jun 14, 2016, 150) Hello, world, I am 0 of 32, (Open MPI v1.10.3, package: Open MPI gwol...@snode2.p-stheno.tartan.njit.edu Distribution, ident: 1.10.3, repo rev: v1.10.2-251-g9acf492, Jun 14, 2016, 150) Hello, world, I am 25 of 32, (Open MPI v1.10.3, package: Open MPI gwol...@snode2.p-stheno.tartan.njit.edu Distribution, ident: 1.10.3, repo rev: v1.10.2-251-g9acf492, Jun 14, 2016, 150) Hello, world, I am 1 of 32, (Open MPI v1.10.3, package: Open MPI gwol...@snode2.p-stheno.tartan.njit.edu Distribution, ident: 1.10.3, repo rev: v1.10.2-251-g9acf492, Jun 14, 2016, 150) Hello, world, I am 27 of 32, (Open MPI v1.10.3, package: Open MPI gwol...@snode2.p-stheno.tartan.njit.edu Distribution, ident: 1.10.3, repo rev: v1.10.2-251-g9acf492, Jun 14, 2016, 150) Hello, world, I am 2 of 32, (Open MPI v1.10.3, package: Open MPI gwol...@snode2.p-stheno.tartan.njit.edu Distribution, ident: 1.10.3, repo rev: v1.10.2-251-g9acf492, Jun 14, 2016, 150) Hello, world, I am 29 of 32, (Open MPI v1.10.3, package: Open MPI gwol...@snode2.p-stheno.tartan.njit.edu Distribution, ident: 1.10.3, repo rev: v1.10.2-251-g9acf492, Jun 14, 2016, 150) Hello, world, I am 31 of 32, (Open MPI v1.10.3, package: Open MPI gwol...@snode2.p-stheno.tartan.njit.edu Distribution, ident: 1.10.3, repo rev: v1.10.2-251-g9acf492, Jun 14, 2016, 150) Hello, world, I am 30 of 32, (Open MPI v1.10.3, package: Open MPI gwol...@snode2.p-stheno.tartan.njit.edu Distribution, ident: 1.10.3, repo rev: v1.10.2-251-g9acf492, Jun 14, 2016, 150) Hello, world, I am 4 of 32, (Open MPI v1.10.3, package: Open MPI gwol...@snode2.p-stheno.tartan.njit.edu Distribution, ident: 1.10.3, repo rev: v1.10.2-251-g9acf492, Jun 14, 2016, 150) Hello, world, I am 5 of 32, (Open MPI v1.10.3, package: Open MPI gwol...@snode2.p-stheno.tartan.njit.edu Distribution, ident: 1.10.3, repo rev: v1.10.2-251-g9acf492, Jun 14, 2016, 150) Hello, world, I am 17 of 32, (Open MPI v1.10.3, package: Open MPI gwol...@snode2.p-stheno.tartan.njit.edu Distribution, ident: 1.10.3, repo rev: v1.10.2-251-g9acf492, Jun 14, 2016, 150) Hello, world, I am 3 of 32, (Open MPI v1.10.3, package: Open MPI gwol...@snode2.p-stheno.tartan.njit.edu Distribution, ident: 1.10.3, repo rev: v1.10.2-251-g9acf492, Jun 14, 2016, 150) Hello, world, I am 7 of 32, (Open MPI v1.10.3, package: Open MPI gwol...@snode2.p-stheno.tartan.njit.edu Distribution, ident: 1.10.3, repo rev: v1.10.2-251-g9acf492, Jun 14, 2016, 150) Hello, world, I am 6 of 32, (Open MPI v1.10.3, package: Open MPI gwol...@snode2.p-stheno.tartan.njit.edu Distribution, ident: 1.10.3, repo rev: v1.10.2-251-g9acf492, Jun 14, 2016, 150) Hello, world, I am 18 of 32, (Open MPI v1.10.3, package: Open MPI gwol...@snode2.p-stheno.tartan.njit.edu Distribution, ident: 1.10.3, repo rev: v1.10.2-251-g9acf492, Jun 14, 2016, 150) Hello, world, I am 22 of 32, (Open MPI v1.10.3, package: Open MPI gwol...@snode2.p-stheno.tartan.njit.edu Distribution, ident: 1.10.3, repo rev: v1.10.2-251-g9acf492, Jun 14, 2016, 150) Hello, world, I am 23 of 32, (Open MPI v1.10.3, package: Open MPI gwol...@snode2.p-stheno.tartan.njit.edu Distribution, ident: 1.10.3, repo rev: v1.10.2-251-g9acf492, Jun 14, 2016, 150) Hello, world, I am 19 of 32, (Open MPI v1.10.3, package: Open MPI gwol...@snode2.p-stheno.tartan.njit.edu Distribution, ident: 1.10.3, repo rev: v1.10.2-251-g9acf492, Jun 14, 2016, 150) Hello, world, I am 9 of 32, (Open MPI v1.10.3, package: Open MPI gwol...@snode2.p-stheno.tartan.njit.edu Distribution, ident: 1.10.3, repo rev: v1.10.2-251-g9acf492, Jun 14, 2016, 150) Hello, world, I am 20 of 32, (Open MPI v1.10.3, package: Open MPI gwol...@snode2.p-stheno.tartan.njit.edu Distribution, ident: 1.10.3, repo rev: v1.10.2-251-g9acf492, Jun 14, 2016, 150) Hello, world, I am 8 of 32, (Open MPI v1.10.3, package: Open MPI gwol...@snode2.p-stheno.tartan.njit.edu Distribution, ident: 1.10.3, repo rev: v1.10.2-251-g9acf492, Jun 14, 2016, 150) Hello, world, I am 10 of 32, (Open MPI v1.10.3, package: Open MPI gwol...@snode2.p-stheno.tartan.njit.edu Distribution, ident: 1.10.3, repo rev: v1.10.2-251-g9acf492, Jun 14, 2016, 150) Hello, world, I am 13 of 32, (Open MPI v1.10.3, package: Open MPI gwol...@snode2.p-stheno.tartan.njit.edu Distribution, ident: 1.10.3, repo rev: v1.10.2-251-g9acf492, Jun 14, 2016, 150) Hello, world, I am 11 of 32, (Open MPI v1.10.3, package: Open MPI gwol...@snode2.p-stheno.tartan.njit.edu Distribution, ident: 1.10.3, repo rev: v1.10.2-251-g9acf492, Jun 14, 2016, 150) Hello, world, I am 26 of 32, (Open MPI v1.10.3, package: Open MPI gwol...@snode2.p-stheno.tartan.njit.edu Distribution, ident: 1.10.3, repo rev: v1.10.2-251-g9acf492, Jun 14, 2016, 150) Hello, world, I am 16 of 32, (Open MPI v1.10.3, package: Open MPI gwol...@snode2.p-stheno.tartan.njit.edu Distribution, ident: 1.10.3, repo rev: v1.10.2-251-g9acf492, Jun 14, 2016, 150) Hello, world, I am 14 of 32, (Open MPI v1.10.3, package: Open MPI gwol...@snode2.p-stheno.tartan.njit.edu Distribution, ident: 1.10.3, repo rev: v1.10.2-251-g9acf492, Jun 14, 2016, 150) Hello, world, I am 28 of 32, (Open MPI v1.10.3, package: Open MPI gwol...@snode2.p-stheno.tartan.njit.edu Distribution, ident: 1.10.3, repo rev: v1.10.2-251-g9acf492, Jun 14, 2016, 150) Hello, world, I am 21 of 32, (Open MPI v1.10.3, package: Open MPI gwol...@snode2.p-stheno.tartan.njit.edu Distribution, ident: 1.10.3, repo rev: v1.10.2-251-g9acf492, Jun 14, 2016, 150) Hello, world, I am 15 of 32, (Open MPI v1.10.3, package: Open MPI gwol...@snode2.p-stheno.tartan.njit.edu Distribution, ident: 1.10.3, repo rev: v1.10.2-251-g9acf492, Jun 14, 2016, 150) Hello, world, I am 12 of 32, (Open MPI v1.10.3, package: Open MPI gwol...@snode2.p-stheno.tartan.njit.edu Distribution, ident: 1.10.3, repo rev: v1.10.2-251-g9acf492, Jun 14, 2016, 150) srun --mpi=pmi2 --ntasks-per-node=8 --ntasks=16 --nodes=2 ./ring_c > ring_c.out Process 1 exiting Process 12 exiting Process 14 exiting Process 13 exiting Process 3 exiting Process 11 exiting Process 5 exiting Process 6 exiting Process 2 exiting Process 4 exiting Process 9 exiting Process 10 exiting Process 7 exiting Process 15 exiting Process 0 sending 10 to 1, tag 201 (16 processes in ring) Process 0 sent to 1 Process 0 decremented value: 9 Process 0 decremented value: 8 Process 0 decremented value: 7 Process 0 decremented value: 6 Process 0 decremented value: 5 Process 0 decremented value: 4 Process 0 decremented value: 3 Process 0 decremented value: 2 Process 0 decremented value: 1 Process 0 decremented value: 0 Process 0 exiting Process 8 exiting > > 2017-12-07 10:05 GMT-08:00 Glenn (Gedaliah) Wolosh <gwol...@njit.edu > <mailto:gwol...@njit.edu>>: > > >> On Dec 7, 2017, at 12:51 PM, Artem Polyakov <artpo...@gmail.com >> <mailto:artpo...@gmail.com>> wrote: >> >> also please post the output of >> $ srun --mpi=list > > [gwolosh@p-slogin bin]$ srun --mpi=list > srun: MPI types are... > srun: mpi/mpich1_shmem > srun: mpi/mpich1_p4 > srun: mpi/lam > srun: mpi/openmpi > srun: mpi/none > srun: mpi/mvapich > srun: mpi/mpichmx > srun: mpi/pmi2 > srun: mpi/mpichgm > > >> >> When job crashes - is there any error messages in the relevant slurmd.log's >> or output on the screen? > > on screen — > > [snode4][[274,1],24][connect/btl_openib_connect_udcm.c:1448:udcm_wait_for_send_completion] > send failed with verbs status 2 > [snode4:5175] *** An error occurred in MPI_Bcast > [snode4:5175] *** reported by process [17956865,24] > [snode4:5175] *** on communicator MPI_COMM_WORLD > [snode4:5175] *** MPI_ERR_OTHER: known error not in list > [snode4:5175] *** MPI_ERRORS_ARE_FATAL (processes in this communicator will > now abort, > [snode4:5175] *** and potentially your MPI job) > mlx4: local QP operation err (QPN 0005f3, WQE index 40000, vendor syndrome > 6c, opcode = 5e) > srun: Job step aborted: Waiting up to 32 seconds for job step to finish. > [snode4][[274,1],31][connect/btl_openib_connect_udcm.c:1448:udcm_wait_for_send_completion] > send failed with verbs status 2 > slurmstepd: error: *** STEP 274.0 ON snode1 CANCELLED AT 2017-12-07T12:55:46 > *** > [snode4:5182] *** An error occurred in MPI_Bcast > [snode4:5182] *** reported by process [17956865,31] > [snode4:5182] *** on communicator MPI_COMM_WORLD > [snode4:5182] *** MPI_ERR_OTHER: known error not in list > [snode4:5182] *** MPI_ERRORS_ARE_FATAL (processes in this communicator will > now abort, > [snode4:5182] *** and potentially your MPI job) > mlx4: local QP operation err (QPN 0005f7, WQE index 40000, vendor syndrome > 6c, opcode = 5e) > [snode4][[274,1],27][connect/btl_openib_connect_udcm.c:1448:udcm_wait_for_send_completion] > send failed with verbs status 2 > [snode4:5178] *** An error occurred in MPI_Bcast > [snode4:5178] *** reported by process [17956865,27] > [snode4:5178] *** on communicator MPI_COMM_WORLD > [snode4:5178] *** MPI_ERR_OTHER: known error not in list > [snode4:5178] *** MPI_ERRORS_ARE_FATAL (processes in this communicator will > now abort, > [snode4:5178] *** and potentially your MPI job) > mlx4: local QP operation err (QPN 0005fa, WQE index 40000, vendor syndrome > 6c, opcode = 5e) > srun: error: snode4: tasks 24,31: Exited with exit code 16 > srun: error: snode4: tasks 25-30: Killed > srun: error: snode5: tasks 32-39: Killed > srun: error: snode3: tasks 16-23: Killed > srun: error: snode8: tasks 56-63: Killed > srun: error: snode7: tasks 48-55: Killed > srun: error: snode1: tasks 0-7: Killed > srun: error: snode2: tasks 8-15: Killed > srun: error: snode6: tasks 40-47: Killed > > Nothing striking in the slurmd log > > >> >> 2017-12-07 9:49 GMT-08:00 Artem Polyakov <artpo...@gmail.com >> <mailto:artpo...@gmail.com>>: >> Hello, >> >> what is the value of MpiDefault option in your Slurm configuration file? >> >> 2017-12-07 9:37 GMT-08:00 Glenn (Gedaliah) Wolosh <gwol...@njit.edu >> <mailto:gwol...@njit.edu>>: >> Hello >> >> This is using Slurm version - 17.02.6 running on Scientific Linux release >> 7.4 (Nitrogen) >> >> [gwolosh@p-slogin bin]$ module li >> >> Currently Loaded Modules: >> 1) GCCcore/.5.4.0 (H) 2) binutils/.2.26 (H) 3) GCC/5.4.0-2.26 4) >> numactl/2.0.11 5) hwloc/1.11.3 6) OpenMPI/1.10.3 >> >> If I run >> >> srun --nodes=8 --ntasks-per-node=8 --ntasks=64 ./ep.C.64 >> >> It runs successfuly but I get a message — >> >> PMI2 initialized but returned bad values for size/rank/jobid. >> This is symptomatic of either a failure to use the >> "--mpi=pmi2" flag in SLURM, or a borked PMI2 installation. >> If running under SLURM, try adding "-mpi=pmi2" to your >> srun command line. If that doesn't work, or if you are >> not running under SLURM, try removing or renaming the >> pmi2.h header file so PMI2 support will not automatically >> be built, reconfigure and build OMPI, and then try again >> with only PMI1 support enabled. >> >> If I run >> >> srun --nodes=8 --ntasks-per-node=8 --ntasks=64 —mpi=pmi2 ./ep.C.64 >> >> The job crashes >> >> If I run via sbatch — >> >> #!/bin/bash >> # Job name: >> #SBATCH --job-name=nas_bench >> #SBATCH --nodes=8 >> #SBATCH --ntasks=64 >> #SBATCH --ntasks-per-node=8 >> #SBATCH --time=48:00:00 >> #SBATCH --output=nas.out.1 >> # >> ## Command(s) to run (example): >> module use $HOME/easybuild/modules/all/Core >> module load GCC/5.4.0-2.26 OpenMPI/1.10.3 >> mpirun -np 64 ./ep.C.64 >> >> the job crashes >> >> Using easybuild, these are my config options for ompi — >> >> configopts = '--with-threads=posix --enable-shared >> --enable-mpi-thread-multiple --with-verbs ' >> configopts += '--enable-mpirun-prefix-by-default ' # suppress failure modes >> in relation to mpirun path >> configopts += '--with-hwloc=$EBROOTHWLOC ' # hwloc support >> configopts += '--disable-dlopen ' # statically link component, don't do >> dynamic loading >> configopts += '--with-slurm --with-pmi ‘ >> >> And finally — >> >> $ ldd >> /opt/local/easybuild/software/Compiler/GCC/5.4.0-2.26/OpenMPI/1.10.3/bin/orterun >> | grep pmi >> libpmi.so.0 => /usr/lib64/libpmi.so.0 (0x00007f0129d6d000) >> libpmi2.so.0 => /usr/lib64/libpmi2.so.0 (0x00007f0129b51000) >> >> $ ompi_info | grep pmi >> MCA db: pmi (MCA v2.0.0, API v1.0.0, Component v1.10.3) >> MCA ess: pmi (MCA v2.0.0, API v3.0.0, Component v1.10.3) >> MCA grpcomm: pmi (MCA v2.0.0, API v2.0.0, Component v1.10.3) >> MCA pubsub: pmi (MCA v2.0.0, API v2.0.0, Component v1.10.3) >> >> >> Any suggestions? >> _______________ >> Gedaliah Wolosh >> IST Academic and Research Computing Systems (ARCS) >> NJIT >> GITC 2203 >> 973 596 5437 <tel:(973)%20596-5437> >> gwol...@njit.edu <mailto:gwol...@njit.edu> >> >> >> >> >> -- >> С Уважением, Поляков Артем Юрьевич >> Best regards, Artem Y. Polyakov >> >> >> >> -- >> С Уважением, Поляков Артем Юрьевич >> Best regards, Artem Y. Polyakov > > > > > -- > С Уважением, Поляков Артем Юрьевич > Best regards, Artem Y. Polyakov