Given that ring is working I don't think that it's a PMI problem. Can you try running NPB with the tcp btl parameters that I've provided? (I assume you have TCP interconnect, let me know if it's not a case).
чт, 7 дек. 2017 г. в 12:03, Glenn (Gedaliah) Wolosh <gwol...@njit.edu>: > On Dec 7, 2017, at 1:18 PM, Artem Polyakov <artpo...@gmail.com> wrote: > > Couple of things to try to locate the issue: > > 1. To make sure that PMI is not working: have you tried to run something > simple (like hello_world ( > https://github.com/open-mpi/ompi/blob/master/examples/hello_c.c) and ring > (https://github.com/open-mpi/ompi/blob/master/examples/ring_c.c). Please > try to run those two and post the results. > 2. If hello is working and ring is not can you try to change the fabric to > TCP: > $ export OMPI_MCA_btl=tcp,self > $ export OMPI_MCA_pml=ob1 > $ srun ... > > Please provide the outputs > > > srun --mpi=pmi2 --ntasks-per-node=8 --ntasks=16 ./hello_c > hello_c.out > > Hello, world, I am 24 of 32, (Open MPI v1.10.3, package: Open MPI > gwol...@snode2.p-stheno.tartan.njit.edu Distribution, ident: 1.10.3, repo > rev: v1.10.2-251-g9acf492, Jun 14, 2016, 150) > Hello, world, I am 0 of 32, (Open MPI v1.10.3, package: Open MPI > gwol...@snode2.p-stheno.tartan.njit.edu Distribution, ident: 1.10.3, repo > rev: v1.10.2-251-g9acf492, Jun 14, 2016, 150) > Hello, world, I am 25 of 32, (Open MPI v1.10.3, package: Open MPI > gwol...@snode2.p-stheno.tartan.njit.edu Distribution, ident: 1.10.3, repo > rev: v1.10.2-251-g9acf492, Jun 14, 2016, 150) > Hello, world, I am 1 of 32, (Open MPI v1.10.3, package: Open MPI > gwol...@snode2.p-stheno.tartan.njit.edu Distribution, ident: 1.10.3, repo > rev: v1.10.2-251-g9acf492, Jun 14, 2016, 150) > Hello, world, I am 27 of 32, (Open MPI v1.10.3, package: Open MPI > gwol...@snode2.p-stheno.tartan.njit.edu Distribution, ident: 1.10.3, repo > rev: v1.10.2-251-g9acf492, Jun 14, 2016, 150) > Hello, world, I am 2 of 32, (Open MPI v1.10.3, package: Open MPI > gwol...@snode2.p-stheno.tartan.njit.edu Distribution, ident: 1.10.3, repo > rev: v1.10.2-251-g9acf492, Jun 14, 2016, 150) > Hello, world, I am 29 of 32, (Open MPI v1.10.3, package: Open MPI > gwol...@snode2.p-stheno.tartan.njit.edu Distribution, ident: 1.10.3, repo > rev: v1.10.2-251-g9acf492, Jun 14, 2016, 150) > Hello, world, I am 31 of 32, (Open MPI v1.10.3, package: Open MPI > gwol...@snode2.p-stheno.tartan.njit.edu Distribution, ident: 1.10.3, repo > rev: v1.10.2-251-g9acf492, Jun 14, 2016, 150) > Hello, world, I am 30 of 32, (Open MPI v1.10.3, package: Open MPI > gwol...@snode2.p-stheno.tartan.njit.edu Distribution, ident: 1.10.3, repo > rev: v1.10.2-251-g9acf492, Jun 14, 2016, 150) > Hello, world, I am 4 of 32, (Open MPI v1.10.3, package: Open MPI > gwol...@snode2.p-stheno.tartan.njit.edu Distribution, ident: 1.10.3, repo > rev: v1.10.2-251-g9acf492, Jun 14, 2016, 150) > Hello, world, I am 5 of 32, (Open MPI v1.10.3, package: Open MPI > gwol...@snode2.p-stheno.tartan.njit.edu Distribution, ident: 1.10.3, repo > rev: v1.10.2-251-g9acf492, Jun 14, 2016, 150) > Hello, world, I am 17 of 32, (Open MPI v1.10.3, package: Open MPI > gwol...@snode2.p-stheno.tartan.njit.edu Distribution, ident: 1.10.3, repo > rev: v1.10.2-251-g9acf492, Jun 14, 2016, 150) > Hello, world, I am 3 of 32, (Open MPI v1.10.3, package: Open MPI > gwol...@snode2.p-stheno.tartan.njit.edu Distribution, ident: 1.10.3, repo > rev: v1.10.2-251-g9acf492, Jun 14, 2016, 150) > Hello, world, I am 7 of 32, (Open MPI v1.10.3, package: Open MPI > gwol...@snode2.p-stheno.tartan.njit.edu Distribution, ident: 1.10.3, repo > rev: v1.10.2-251-g9acf492, Jun 14, 2016, 150) > Hello, world, I am 6 of 32, (Open MPI v1.10.3, package: Open MPI > gwol...@snode2.p-stheno.tartan.njit.edu Distribution, ident: 1.10.3, repo > rev: v1.10.2-251-g9acf492, Jun 14, 2016, 150) > Hello, world, I am 18 of 32, (Open MPI v1.10.3, package: Open MPI > gwol...@snode2.p-stheno.tartan.njit.edu Distribution, ident: 1.10.3, repo > rev: v1.10.2-251-g9acf492, Jun 14, 2016, 150) > Hello, world, I am 22 of 32, (Open MPI v1.10.3, package: Open MPI > gwol...@snode2.p-stheno.tartan.njit.edu Distribution, ident: 1.10.3, repo > rev: v1.10.2-251-g9acf492, Jun 14, 2016, 150) > Hello, world, I am 23 of 32, (Open MPI v1.10.3, package: Open MPI > gwol...@snode2.p-stheno.tartan.njit.edu Distribution, ident: 1.10.3, repo > rev: v1.10.2-251-g9acf492, Jun 14, 2016, 150) > Hello, world, I am 19 of 32, (Open MPI v1.10.3, package: Open MPI > gwol...@snode2.p-stheno.tartan.njit.edu Distribution, ident: 1.10.3, repo > rev: v1.10.2-251-g9acf492, Jun 14, 2016, 150) > Hello, world, I am 9 of 32, (Open MPI v1.10.3, package: Open MPI > gwol...@snode2.p-stheno.tartan.njit.edu Distribution, ident: 1.10.3, repo > rev: v1.10.2-251-g9acf492, Jun 14, 2016, 150) > Hello, world, I am 20 of 32, (Open MPI v1.10.3, package: Open MPI > gwol...@snode2.p-stheno.tartan.njit.edu Distribution, ident: 1.10.3, repo > rev: v1.10.2-251-g9acf492, Jun 14, 2016, 150) > Hello, world, I am 8 of 32, (Open MPI v1.10.3, package: Open MPI > gwol...@snode2.p-stheno.tartan.njit.edu Distribution, ident: 1.10.3, repo > rev: v1.10.2-251-g9acf492, Jun 14, 2016, 150) > Hello, world, I am 10 of 32, (Open MPI v1.10.3, package: Open MPI > gwol...@snode2.p-stheno.tartan.njit.edu Distribution, ident: 1.10.3, repo > rev: v1.10.2-251-g9acf492, Jun 14, 2016, 150) > Hello, world, I am 13 of 32, (Open MPI v1.10.3, package: Open MPI > gwol...@snode2.p-stheno.tartan.njit.edu Distribution, ident: 1.10.3, repo > rev: v1.10.2-251-g9acf492, Jun 14, 2016, 150) > Hello, world, I am 11 of 32, (Open MPI v1.10.3, package: Open MPI > gwol...@snode2.p-stheno.tartan.njit.edu Distribution, ident: 1.10.3, repo > rev: v1.10.2-251-g9acf492, Jun 14, 2016, 150) > Hello, world, I am 26 of 32, (Open MPI v1.10.3, package: Open MPI > gwol...@snode2.p-stheno.tartan.njit.edu Distribution, ident: 1.10.3, repo > rev: v1.10.2-251-g9acf492, Jun 14, 2016, 150) > Hello, world, I am 16 of 32, (Open MPI v1.10.3, package: Open MPI > gwol...@snode2.p-stheno.tartan.njit.edu Distribution, ident: 1.10.3, repo > rev: v1.10.2-251-g9acf492, Jun 14, 2016, 150) > Hello, world, I am 14 of 32, (Open MPI v1.10.3, package: Open MPI > gwol...@snode2.p-stheno.tartan.njit.edu Distribution, ident: 1.10.3, repo > rev: v1.10.2-251-g9acf492, Jun 14, 2016, 150) > Hello, world, I am 28 of 32, (Open MPI v1.10.3, package: Open MPI > gwol...@snode2.p-stheno.tartan.njit.edu Distribution, ident: 1.10.3, repo > rev: v1.10.2-251-g9acf492, Jun 14, 2016, 150) > Hello, world, I am 21 of 32, (Open MPI v1.10.3, package: Open MPI > gwol...@snode2.p-stheno.tartan.njit.edu Distribution, ident: 1.10.3, repo > rev: v1.10.2-251-g9acf492, Jun 14, 2016, 150) > Hello, world, I am 15 of 32, (Open MPI v1.10.3, package: Open MPI > gwol...@snode2.p-stheno.tartan.njit.edu Distribution, ident: 1.10.3, repo > rev: v1.10.2-251-g9acf492, Jun 14, 2016, 150) > Hello, world, I am 12 of 32, (Open MPI v1.10.3, package: Open MPI > gwol...@snode2.p-stheno.tartan.njit.edu Distribution, ident: 1.10.3, repo > rev: v1.10.2-251-g9acf492, Jun 14, 2016, 150) > > srun --mpi=pmi2 --ntasks-per-node=8 --ntasks=16 --nodes=2 ./ring_c > > ring_c.out > > Process 1 exiting > Process 12 exiting > Process 14 exiting > Process 13 exiting > Process 3 exiting > Process 11 exiting > Process 5 exiting > Process 6 exiting > Process 2 exiting > Process 4 exiting > Process 9 exiting > Process 10 exiting > Process 7 exiting > Process 15 exiting > Process 0 sending 10 to 1, tag 201 (16 processes in ring) > Process 0 sent to 1 > Process 0 decremented value: 9 > Process 0 decremented value: 8 > Process 0 decremented value: 7 > Process 0 decremented value: 6 > Process 0 decremented value: 5 > Process 0 decremented value: 4 > Process 0 decremented value: 3 > Process 0 decremented value: 2 > Process 0 decremented value: 1 > Process 0 decremented value: 0 > Process 0 exiting > Process 8 exiting > > > 2017-12-07 10:05 GMT-08:00 Glenn (Gedaliah) Wolosh <gwol...@njit.edu>: > >> >> >> On Dec 7, 2017, at 12:51 PM, Artem Polyakov <artpo...@gmail.com> wrote: >> >> also please post the output of >> $ srun --mpi=list >> >> >> [gwolosh@p-slogin bin]$ srun --mpi=list >> srun: MPI types are... >> srun: mpi/mpich1_shmem >> srun: mpi/mpich1_p4 >> srun: mpi/lam >> srun: mpi/openmpi >> srun: mpi/none >> srun: mpi/mvapich >> srun: mpi/mpichmx >> srun: mpi/pmi2 >> srun: mpi/mpichgm >> >> >> >> When job crashes - is there any error messages in the relevant >> slurmd.log's or output on the screen? >> >> >> on screen — >> >> [snode4][[274,1],24][connect/btl_openib_connect_udcm.c:1448:udcm_wait_for_send_completion] >> send failed with verbs status 2 >> [snode4:5175] *** An error occurred in MPI_Bcast >> [snode4:5175] *** reported by process [17956865,24] >> [snode4:5175] *** on communicator MPI_COMM_WORLD >> [snode4:5175] *** MPI_ERR_OTHER: known error not in list >> [snode4:5175] *** MPI_ERRORS_ARE_FATAL (processes in this communicator >> will now abort, >> [snode4:5175] *** and potentially your MPI job) >> mlx4: local QP operation err (QPN 0005f3, WQE index 40000, vendor >> syndrome 6c, opcode = 5e) >> srun: Job step aborted: Waiting up to 32 seconds for job step to finish. >> [snode4][[274,1],31][connect/btl_openib_connect_udcm.c:1448:udcm_wait_for_send_completion] >> send failed with verbs status 2 >> slurmstepd: error: *** STEP 274.0 ON snode1 CANCELLED AT >> 2017-12-07T12:55:46 *** >> [snode4:5182] *** An error occurred in MPI_Bcast >> [snode4:5182] *** reported by process [17956865,31] >> [snode4:5182] *** on communicator MPI_COMM_WORLD >> [snode4:5182] *** MPI_ERR_OTHER: known error not in list >> [snode4:5182] *** MPI_ERRORS_ARE_FATAL (processes in this communicator >> will now abort, >> [snode4:5182] *** and potentially your MPI job) >> mlx4: local QP operation err (QPN 0005f7, WQE index 40000, vendor >> syndrome 6c, opcode = 5e) >> [snode4][[274,1],27][connect/btl_openib_connect_udcm.c:1448:udcm_wait_for_send_completion] >> send failed with verbs status 2 >> [snode4:5178] *** An error occurred in MPI_Bcast >> [snode4:5178] *** reported by process [17956865,27] >> [snode4:5178] *** on communicator MPI_COMM_WORLD >> [snode4:5178] *** MPI_ERR_OTHER: known error not in list >> [snode4:5178] *** MPI_ERRORS_ARE_FATAL (processes in this communicator >> will now abort, >> [snode4:5178] *** and potentially your MPI job) >> mlx4: local QP operation err (QPN 0005fa, WQE index 40000, vendor >> syndrome 6c, opcode = 5e) >> srun: error: snode4: tasks 24,31: Exited with exit code 16 >> srun: error: snode4: tasks 25-30: Killed >> srun: error: snode5: tasks 32-39: Killed >> srun: error: snode3: tasks 16-23: Killed >> srun: error: snode8: tasks 56-63: Killed >> srun: error: snode7: tasks 48-55: Killed >> srun: error: snode1: tasks 0-7: Killed >> srun: error: snode2: tasks 8-15: Killed >> srun: error: snode6: tasks 40-47: Killed >> >> Nothing striking in the slurmd log >> >> >> >> 2017-12-07 9:49 GMT-08:00 Artem Polyakov <artpo...@gmail.com>: >> >>> Hello, >>> >>> what is the value of MpiDefault option in your Slurm configuration file? >>> >>> 2017-12-07 9:37 GMT-08:00 Glenn (Gedaliah) Wolosh <gwol...@njit.edu>: >>> >>>> Hello >>>> >>>> This is using Slurm version - 17.02.6 running on Scientific Linux >>>> release 7.4 (Nitrogen) >>>> >>>> [gwolosh@p-slogin bin]$ module li >>>> >>>> Currently Loaded Modules: >>>> 1) GCCcore/.5.4.0 (H) 2) binutils/.2.26 (H) 3) GCC/5.4.0-2.26 >>>> 4) numactl/2.0.11 5) hwloc/1.11.3 6) OpenMPI/1.10.3 >>>> >>>> If I run >>>> >>>> srun --nodes=8 --ntasks-per-node=8 --ntasks=64 ./ep.C.64 >>>> >>>> It runs successfuly but I get a message — >>>> >>>> PMI2 initialized but returned bad values for size/rank/jobid. >>>> This is symptomatic of either a failure to use the >>>> "--mpi=pmi2" flag in SLURM, or a borked PMI2 installation. >>>> If running under SLURM, try adding "-mpi=pmi2" to your >>>> srun command line. If that doesn't work, or if you are >>>> not running under SLURM, try removing or renaming the >>>> pmi2.h header file so PMI2 support will not automatically >>>> be built, reconfigure and build OMPI, and then try again >>>> with only PMI1 support enabled. >>>> >>>> If I run >>>> >>>> srun --nodes=8 --ntasks-per-node=8 --ntasks=64 —mpi=pmi2 ./ep.C.64 >>>> >>>> The job crashes >>>> >>>> If I run via sbatch — >>>> >>>> #!/bin/bash >>>> # Job name: >>>> #SBATCH --job-name=nas_bench >>>> #SBATCH --nodes=8 >>>> #SBATCH --ntasks=64 >>>> #SBATCH --ntasks-per-node=8 >>>> #SBATCH --time=48:00:00 >>>> #SBATCH --output=nas.out.1 >>>> # >>>> ## Command(s) to run (example): >>>> module use $HOME/easybuild/modules/all/Core >>>> module load GCC/5.4.0-2.26 OpenMPI/1.10.3 >>>> mpirun -np 64 ./ep.C.64 >>>> >>>> the job crashes >>>> >>>> Using easybuild, these are my config options for ompi — >>>> >>>> configopts = '--with-threads=posix --enable-shared >>>> --enable-mpi-thread-multiple --with-verbs ' >>>> configopts += '--enable-mpirun-prefix-by-default ' # suppress failure >>>> modes in relation to mpirun path >>>> configopts += '--with-hwloc=$EBROOTHWLOC ' # hwloc support >>>> configopts += '--disable-dlopen ' # statically link component, don't >>>> do dynamic loading >>>> configopts += '--with-slurm --with-pmi ‘ >>>> >>>> And finally — >>>> >>>> $ ldd >>>> /opt/local/easybuild/software/Compiler/GCC/5.4.0-2.26/OpenMPI/1.10.3/bin/orterun >>>> | grep pmi >>>> libpmi.so.0 => /usr/lib64/libpmi.so.0 (0x00007f0129d6d000) >>>> libpmi2.so.0 => /usr/lib64/libpmi2.so.0 (0x00007f0129b51000) >>>> >>>> $ ompi_info | grep pmi >>>> MCA db: pmi (MCA v2.0.0, API v1.0.0, Component >>>> v1.10.3) >>>> MCA ess: pmi (MCA v2.0.0, API v3.0.0, Component >>>> v1.10.3) >>>> MCA grpcomm: pmi (MCA v2.0.0, API v2.0.0, Component >>>> v1.10.3) >>>> MCA pubsub: pmi (MCA v2.0.0, API v2.0.0, Component >>>> v1.10.3) >>>> >>>> >>>> Any suggestions? >>>> _______________ >>>> Gedaliah Wolosh >>>> IST Academic and Research Computing Systems (ARCS) >>>> NJIT >>>> GITC 2203 >>>> 973 596 5437 <(973)%20596-5437> >>>> gwol...@njit.edu >>>> >>>> >>> >>> >>> -- >>> С Уважением, Поляков Артем Юрьевич >>> Best regards, Artem Y. Polyakov >>> >> >> >> >> -- >> С Уважением, Поляков Артем Юрьевич >> Best regards, Artem Y. Polyakov >> >> >> > > > -- > С Уважением, Поляков Артем Юрьевич > Best regards, Artem Y. Polyakov > > -- ----- Best regards, Artem Polyakov (Mobile mail)