You seem to use a very old OMPI implementation (the current one is 3.0). So I'd suggest to try it if you can. And it seem like a pure OMPI problem so OMPI dev list may be more appropriate for this topic.
2017-12-07 12:53 GMT-08:00 Glenn (Gedaliah) Wolosh <gwol...@njit.edu>: > > > On Dec 7, 2017, at 3:26 PM, Artem Polyakov <artpo...@gmail.com> wrote: > > Given that ring is working I don't think that it's a PMI problem. > > Can you try running NPB with the tcp btl parameters that I've provided? (I > assume you have TCP interconnect, let me know if it's not a case). > > > чт, 7 дек. 2017 г. в 12:03, Glenn (Gedaliah) Wolosh <gwol...@njit.edu>: > >> On Dec 7, 2017, at 1:18 PM, Artem Polyakov <artpo...@gmail.com> wrote: >> >> Couple of things to try to locate the issue: >> >> 1. To make sure that PMI is not working: have you tried to run something >> simple (like hello_world (https://github.com/open-mpi/ >> ompi/blob/master/examples/hello_c.c) and ring ( >> https://github.com/open-mpi/ompi/blob/master/examples/ring_c.c). Please >> try to run those two and post the results. >> 2. If hello is working and ring is not can you try to change the fabric >> to TCP: >> $ export OMPI_MCA_btl=tcp,self >> $ export OMPI_MCA_pml=ob1 >> $ srun ... >> >> Please provide the outputs >> >> > > export OMPI_MCA_btl=tcp,self > export OMPI_MCA_pml=ob1 > > srun --nodes=8 --ntasks-per-node=8 --ntasks=64 --mpi=pmi2 ./ep.C.64 > > This works — > > AS Parallel Benchmarks 3.3 -- EP Benchmark > > Number of random numbers generated: 8589934592 <(858)%20993-4592> > Number of active processes: 64 > > EP Benchmark Results: > > CPU Time = 5.9208 > N = 2^ 32 > No. Gaussian Pairs = 3373275903 <(337)%20327-5903>. > Sums = 4.764367927992081D+04 -8.084072988045549D+04 > Counts: > 0 1572172634. > 1 1501108549. > 2 281805648. > 3 17761221. > 4 424017. > 5 3821. > 6 13. > 7 0. > 8 0. > 9 0. > > > EP Benchmark Completed. > Class = C > Size = 8589934592 <(858)%20993-4592> > Iterations = 0 > Time in seconds = 5.92 > Total processes = 64 > Compiled procs = 64 > Mop/s total = 1450.82 > Mop/s/process = 22.67 > Operation type = Random numbers generated > Verification = SUCCESSFUL > Version = 3.3.1 > Compile date = 07 Dec 2017 > > Compile options: > MPIF77 = mpif77 > FLINK = $(MPIF77) > FMPI_LIB = -L/opt/local/easybuild/software/Compiler/GC... > FMPI_INC = -I/opt/local/easybuild/software/Compiler/GC... > FFLAGS = -O > FLINKFLAGS = -O > RAND = randi8 > > > Please send feedbacks and/or the results of this run to: > > NPB Development Team > Internet: n...@nas.nasa.gov > > Hmm... > > srun --mpi=pmi2 --ntasks-per-node=8 --ntasks=16 ./hello_c > hello_c.out >> >> Hello, world, I am 24 of 32, (Open MPI v1.10.3, package: Open MPI >> gwol...@snode2.p-stheno.tartan.njit.edu Distribution, ident: 1.10.3, >> repo rev: v1.10.2-251-g9acf492, Jun 14, 2016, 150) >> Hello, world, I am 0 of 32, (Open MPI v1.10.3, package: Open MPI >> gwol...@snode2.p-stheno.tartan.njit.edu Distribution, ident: 1.10.3, >> repo rev: v1.10.2-251-g9acf492, Jun 14, 2016, 150) >> Hello, world, I am 25 of 32, (Open MPI v1.10.3, package: Open MPI >> gwol...@snode2.p-stheno.tartan.njit.edu Distribution, ident: 1.10.3, >> repo rev: v1.10.2-251-g9acf492, Jun 14, 2016, 150) >> Hello, world, I am 1 of 32, (Open MPI v1.10.3, package: Open MPI >> gwol...@snode2.p-stheno.tartan.njit.edu Distribution, ident: 1.10.3, >> repo rev: v1.10.2-251-g9acf492, Jun 14, 2016, 150) >> Hello, world, I am 27 of 32, (Open MPI v1.10.3, package: Open MPI >> gwol...@snode2.p-stheno.tartan.njit.edu Distribution, ident: 1.10.3, >> repo rev: v1.10.2-251-g9acf492, Jun 14, 2016, 150) >> Hello, world, I am 2 of 32, (Open MPI v1.10.3, package: Open MPI >> gwol...@snode2.p-stheno.tartan.njit.edu Distribution, ident: 1.10.3, >> repo rev: v1.10.2-251-g9acf492, Jun 14, 2016, 150) >> Hello, world, I am 29 of 32, (Open MPI v1.10.3, package: Open MPI >> gwol...@snode2.p-stheno.tartan.njit.edu Distribution, ident: 1.10.3, >> repo rev: v1.10.2-251-g9acf492, Jun 14, 2016, 150) >> Hello, world, I am 31 of 32, (Open MPI v1.10.3, package: Open MPI >> gwol...@snode2.p-stheno.tartan.njit.edu Distribution, ident: 1.10.3, >> repo rev: v1.10.2-251-g9acf492, Jun 14, 2016, 150) >> Hello, world, I am 30 of 32, (Open MPI v1.10.3, package: Open MPI >> gwol...@snode2.p-stheno.tartan.njit.edu Distribution, ident: 1.10.3, >> repo rev: v1.10.2-251-g9acf492, Jun 14, 2016, 150) >> Hello, world, I am 4 of 32, (Open MPI v1.10.3, package: Open MPI >> gwol...@snode2.p-stheno.tartan.njit.edu Distribution, ident: 1.10.3, >> repo rev: v1.10.2-251-g9acf492, Jun 14, 2016, 150) >> Hello, world, I am 5 of 32, (Open MPI v1.10.3, package: Open MPI >> gwol...@snode2.p-stheno.tartan.njit.edu Distribution, ident: 1.10.3, >> repo rev: v1.10.2-251-g9acf492, Jun 14, 2016, 150) >> Hello, world, I am 17 of 32, (Open MPI v1.10.3, package: Open MPI >> gwol...@snode2.p-stheno.tartan.njit.edu Distribution, ident: 1.10.3, >> repo rev: v1.10.2-251-g9acf492, Jun 14, 2016, 150) >> Hello, world, I am 3 of 32, (Open MPI v1.10.3, package: Open MPI >> gwol...@snode2.p-stheno.tartan.njit.edu Distribution, ident: 1.10.3, >> repo rev: v1.10.2-251-g9acf492, Jun 14, 2016, 150) >> Hello, world, I am 7 of 32, (Open MPI v1.10.3, package: Open MPI >> gwol...@snode2.p-stheno.tartan.njit.edu Distribution, ident: 1.10.3, >> repo rev: v1.10.2-251-g9acf492, Jun 14, 2016, 150) >> Hello, world, I am 6 of 32, (Open MPI v1.10.3, package: Open MPI >> gwol...@snode2.p-stheno.tartan.njit.edu Distribution, ident: 1.10.3, >> repo rev: v1.10.2-251-g9acf492, Jun 14, 2016, 150) >> Hello, world, I am 18 of 32, (Open MPI v1.10.3, package: Open MPI >> gwol...@snode2.p-stheno.tartan.njit.edu Distribution, ident: 1.10.3, >> repo rev: v1.10.2-251-g9acf492, Jun 14, 2016, 150) >> Hello, world, I am 22 of 32, (Open MPI v1.10.3, package: Open MPI >> gwol...@snode2.p-stheno.tartan.njit.edu Distribution, ident: 1.10.3, >> repo rev: v1.10.2-251-g9acf492, Jun 14, 2016, 150) >> Hello, world, I am 23 of 32, (Open MPI v1.10.3, package: Open MPI >> gwol...@snode2.p-stheno.tartan.njit.edu Distribution, ident: 1.10.3, >> repo rev: v1.10.2-251-g9acf492, Jun 14, 2016, 150) >> Hello, world, I am 19 of 32, (Open MPI v1.10.3, package: Open MPI >> gwol...@snode2.p-stheno.tartan.njit.edu Distribution, ident: 1.10.3, >> repo rev: v1.10.2-251-g9acf492, Jun 14, 2016, 150) >> Hello, world, I am 9 of 32, (Open MPI v1.10.3, package: Open MPI >> gwol...@snode2.p-stheno.tartan.njit.edu Distribution, ident: 1.10.3, >> repo rev: v1.10.2-251-g9acf492, Jun 14, 2016, 150) >> Hello, world, I am 20 of 32, (Open MPI v1.10.3, package: Open MPI >> gwol...@snode2.p-stheno.tartan.njit.edu Distribution, ident: 1.10.3, >> repo rev: v1.10.2-251-g9acf492, Jun 14, 2016, 150) >> Hello, world, I am 8 of 32, (Open MPI v1.10.3, package: Open MPI >> gwol...@snode2.p-stheno.tartan.njit.edu Distribution, ident: 1.10.3, >> repo rev: v1.10.2-251-g9acf492, Jun 14, 2016, 150) >> Hello, world, I am 10 of 32, (Open MPI v1.10.3, package: Open MPI >> gwol...@snode2.p-stheno.tartan.njit.edu Distribution, ident: 1.10.3, >> repo rev: v1.10.2-251-g9acf492, Jun 14, 2016, 150) >> Hello, world, I am 13 of 32, (Open MPI v1.10.3, package: Open MPI >> gwol...@snode2.p-stheno.tartan.njit.edu Distribution, ident: 1.10.3, >> repo rev: v1.10.2-251-g9acf492, Jun 14, 2016, 150) >> Hello, world, I am 11 of 32, (Open MPI v1.10.3, package: Open MPI >> gwol...@snode2.p-stheno.tartan.njit.edu Distribution, ident: 1.10.3, >> repo rev: v1.10.2-251-g9acf492, Jun 14, 2016, 150) >> Hello, world, I am 26 of 32, (Open MPI v1.10.3, package: Open MPI >> gwol...@snode2.p-stheno.tartan.njit.edu Distribution, ident: 1.10.3, >> repo rev: v1.10.2-251-g9acf492, Jun 14, 2016, 150) >> Hello, world, I am 16 of 32, (Open MPI v1.10.3, package: Open MPI >> gwol...@snode2.p-stheno.tartan.njit.edu Distribution, ident: 1.10.3, >> repo rev: v1.10.2-251-g9acf492, Jun 14, 2016, 150) >> Hello, world, I am 14 of 32, (Open MPI v1.10.3, package: Open MPI >> gwol...@snode2.p-stheno.tartan.njit.edu Distribution, ident: 1.10.3, >> repo rev: v1.10.2-251-g9acf492, Jun 14, 2016, 150) >> Hello, world, I am 28 of 32, (Open MPI v1.10.3, package: Open MPI >> gwol...@snode2.p-stheno.tartan.njit.edu Distribution, ident: 1.10.3, >> repo rev: v1.10.2-251-g9acf492, Jun 14, 2016, 150) >> Hello, world, I am 21 of 32, (Open MPI v1.10.3, package: Open MPI >> gwol...@snode2.p-stheno.tartan.njit.edu Distribution, ident: 1.10.3, >> repo rev: v1.10.2-251-g9acf492, Jun 14, 2016, 150) >> Hello, world, I am 15 of 32, (Open MPI v1.10.3, package: Open MPI >> gwol...@snode2.p-stheno.tartan.njit.edu Distribution, ident: 1.10.3, >> repo rev: v1.10.2-251-g9acf492, Jun 14, 2016, 150) >> Hello, world, I am 12 of 32, (Open MPI v1.10.3, package: Open MPI >> gwol...@snode2.p-stheno.tartan.njit.edu Distribution, ident: 1.10.3, >> repo rev: v1.10.2-251-g9acf492, Jun 14, 2016, 150) >> >> srun --mpi=pmi2 --ntasks-per-node=8 --ntasks=16 --nodes=2 ./ring_c > >> ring_c.out >> >> Process 1 exiting >> Process 12 exiting >> Process 14 exiting >> Process 13 exiting >> Process 3 exiting >> Process 11 exiting >> Process 5 exiting >> Process 6 exiting >> Process 2 exiting >> Process 4 exiting >> Process 9 exiting >> Process 10 exiting >> Process 7 exiting >> Process 15 exiting >> Process 0 sending 10 to 1, tag 201 (16 processes in ring) >> Process 0 sent to 1 >> Process 0 decremented value: 9 >> Process 0 decremented value: 8 >> Process 0 decremented value: 7 >> Process 0 decremented value: 6 >> Process 0 decremented value: 5 >> Process 0 decremented value: 4 >> Process 0 decremented value: 3 >> Process 0 decremented value: 2 >> Process 0 decremented value: 1 >> Process 0 decremented value: 0 >> Process 0 exiting >> Process 8 exiting >> >> >> 2017-12-07 10:05 GMT-08:00 Glenn (Gedaliah) Wolosh <gwol...@njit.edu>: >> >>> >>> >>> On Dec 7, 2017, at 12:51 PM, Artem Polyakov <artpo...@gmail.com> wrote: >>> >>> also please post the output of >>> $ srun --mpi=list >>> >>> >>> [gwolosh@p-slogin bin]$ srun --mpi=list >>> srun: MPI types are... >>> srun: mpi/mpich1_shmem >>> srun: mpi/mpich1_p4 >>> srun: mpi/lam >>> srun: mpi/openmpi >>> srun: mpi/none >>> srun: mpi/mvapich >>> srun: mpi/mpichmx >>> srun: mpi/pmi2 >>> srun: mpi/mpichgm >>> >>> >>> >>> When job crashes - is there any error messages in the relevant >>> slurmd.log's or output on the screen? >>> >>> >>> on screen — >>> >>> [snode4][[274,1],24][connect/btl_openib_connect_udcm.c: >>> 1448:udcm_wait_for_send_completion] send failed with verbs status 2 >>> [snode4:5175] *** An error occurred in MPI_Bcast >>> [snode4:5175] *** reported by process [17956865,24] >>> [snode4:5175] *** on communicator MPI_COMM_WORLD >>> [snode4:5175] *** MPI_ERR_OTHER: known error not in list >>> [snode4:5175] *** MPI_ERRORS_ARE_FATAL (processes in this communicator >>> will now abort, >>> [snode4:5175] *** and potentially your MPI job) >>> mlx4: local QP operation err (QPN 0005f3, WQE index 40000, vendor >>> syndrome 6c, opcode = 5e) >>> srun: Job step aborted: Waiting up to 32 seconds for job step to finish. >>> [snode4][[274,1],31][connect/btl_openib_connect_udcm.c: >>> 1448:udcm_wait_for_send_completion] send failed with verbs status 2 >>> slurmstepd: error: *** STEP 274.0 ON snode1 CANCELLED AT >>> 2017-12-07T12:55:46 *** >>> [snode4:5182] *** An error occurred in MPI_Bcast >>> [snode4:5182] *** reported by process [17956865,31] >>> [snode4:5182] *** on communicator MPI_COMM_WORLD >>> [snode4:5182] *** MPI_ERR_OTHER: known error not in list >>> [snode4:5182] *** MPI_ERRORS_ARE_FATAL (processes in this communicator >>> will now abort, >>> [snode4:5182] *** and potentially your MPI job) >>> mlx4: local QP operation err (QPN 0005f7, WQE index 40000, vendor >>> syndrome 6c, opcode = 5e) >>> [snode4][[274,1],27][connect/btl_openib_connect_udcm.c: >>> 1448:udcm_wait_for_send_completion] send failed with verbs status 2 >>> [snode4:5178] *** An error occurred in MPI_Bcast >>> [snode4:5178] *** reported by process [17956865,27] >>> [snode4:5178] *** on communicator MPI_COMM_WORLD >>> [snode4:5178] *** MPI_ERR_OTHER: known error not in list >>> [snode4:5178] *** MPI_ERRORS_ARE_FATAL (processes in this communicator >>> will now abort, >>> [snode4:5178] *** and potentially your MPI job) >>> mlx4: local QP operation err (QPN 0005fa, WQE index 40000, vendor >>> syndrome 6c, opcode = 5e) >>> srun: error: snode4: tasks 24,31: Exited with exit code 16 >>> srun: error: snode4: tasks 25-30: Killed >>> srun: error: snode5: tasks 32-39: Killed >>> srun: error: snode3: tasks 16-23: Killed >>> srun: error: snode8: tasks 56-63: Killed >>> srun: error: snode7: tasks 48-55: Killed >>> srun: error: snode1: tasks 0-7: Killed >>> srun: error: snode2: tasks 8-15: Killed >>> srun: error: snode6: tasks 40-47: Killed >>> >>> Nothing striking in the slurmd log >>> >>> >>> >>> 2017-12-07 9:49 GMT-08:00 Artem Polyakov <artpo...@gmail.com>: >>> >>>> Hello, >>>> >>>> what is the value of MpiDefault option in your Slurm configuration file? >>>> >>>> 2017-12-07 9:37 GMT-08:00 Glenn (Gedaliah) Wolosh <gwol...@njit.edu>: >>>> >>>>> Hello >>>>> >>>>> This is using Slurm version - 17.02.6 running on Scientific Linux >>>>> release 7.4 (Nitrogen) >>>>> >>>>> [gwolosh@p-slogin bin]$ module li >>>>> >>>>> Currently Loaded Modules: >>>>> 1) GCCcore/.5.4.0 (H) 2) binutils/.2.26 (H) 3) GCC/5.4.0-2.26 >>>>> 4) numactl/2.0.11 5) hwloc/1.11.3 6) OpenMPI/1.10.3 >>>>> >>>>> If I run >>>>> >>>>> srun --nodes=8 --ntasks-per-node=8 --ntasks=64 ./ep.C.64 >>>>> >>>>> It runs successfuly but I get a message — >>>>> >>>>> PMI2 initialized but returned bad values for size/rank/jobid. >>>>> This is symptomatic of either a failure to use the >>>>> "--mpi=pmi2" flag in SLURM, or a borked PMI2 installation. >>>>> If running under SLURM, try adding "-mpi=pmi2" to your >>>>> srun command line. If that doesn't work, or if you are >>>>> not running under SLURM, try removing or renaming the >>>>> pmi2.h header file so PMI2 support will not automatically >>>>> be built, reconfigure and build OMPI, and then try again >>>>> with only PMI1 support enabled. >>>>> >>>>> If I run >>>>> >>>>> srun --nodes=8 --ntasks-per-node=8 --ntasks=64 —mpi=pmi2 ./ep.C.64 >>>>> >>>>> The job crashes >>>>> >>>>> If I run via sbatch — >>>>> >>>>> #!/bin/bash >>>>> # Job name: >>>>> #SBATCH --job-name=nas_bench >>>>> #SBATCH --nodes=8 >>>>> #SBATCH --ntasks=64 >>>>> #SBATCH --ntasks-per-node=8 >>>>> #SBATCH --time=48:00:00 >>>>> #SBATCH --output=nas.out.1 >>>>> # >>>>> ## Command(s) to run (example): >>>>> module use $HOME/easybuild/modules/all/Core >>>>> module load GCC/5.4.0-2.26 OpenMPI/1.10.3 >>>>> mpirun -np 64 ./ep.C.64 >>>>> >>>>> the job crashes >>>>> >>>>> Using easybuild, these are my config options for ompi — >>>>> >>>>> configopts = '--with-threads=posix --enable-shared >>>>> --enable-mpi-thread-multiple --with-verbs ' >>>>> configopts += '--enable-mpirun-prefix-by-default ' # suppress >>>>> failure modes in relation to mpirun path >>>>> configopts += '--with-hwloc=$EBROOTHWLOC ' # hwloc support >>>>> configopts += '--disable-dlopen ' # statically link component, don't >>>>> do dynamic loading >>>>> configopts += '--with-slurm --with-pmi ‘ >>>>> >>>>> And finally — >>>>> >>>>> $ ldd >>>>> /opt/local/easybuild/software/Compiler/GCC/5.4.0-2.26/OpenMPI/1.10.3/bin/orterun >>>>> | grep pmi >>>>> libpmi.so.0 => /usr/lib64/libpmi.so.0 (0x00007f0129d6d000) >>>>> libpmi2.so.0 => /usr/lib64/libpmi2.so.0 (0x00007f0129b51000) >>>>> >>>>> $ ompi_info | grep pmi >>>>> MCA db: pmi (MCA v2.0.0, API v1.0.0, Component >>>>> v1.10.3) >>>>> MCA ess: pmi (MCA v2.0.0, API v3.0.0, Component >>>>> v1.10.3) >>>>> MCA grpcomm: pmi (MCA v2.0.0, API v2.0.0, Component >>>>> v1.10.3) >>>>> MCA pubsub: pmi (MCA v2.0.0, API v2.0.0, Component >>>>> v1.10.3) >>>>> >>>>> >>>>> Any suggestions? >>>>> _______________ >>>>> Gedaliah Wolosh >>>>> IST Academic and Research Computing Systems (ARCS) >>>>> NJIT >>>>> GITC 2203 >>>>> 973 596 5437 <(973)%20596-5437> >>>>> gwol...@njit.edu >>>>> >>>>> >>>> >>>> >>>> -- >>>> С Уважением, Поляков Артем Юрьевич >>>> Best regards, Artem Y. Polyakov >>>> >>> >>> >>> >>> -- >>> С Уважением, Поляков Артем Юрьевич >>> Best regards, Artem Y. Polyakov >>> >>> >>> >> >> >> -- >> С Уважением, Поляков Артем Юрьевич >> Best regards, Artem Y. Polyakov >> >> -- > ----- Best regards, Artem Polyakov (Mobile mail) > > > -- С Уважением, Поляков Артем Юрьевич Best regards, Artem Y. Polyakov