> On Dec 7, 2017, at 3:26 PM, Artem Polyakov <artpo...@gmail.com> wrote: > > Given that ring is working I don't think that it's a PMI problem. > > Can you try running NPB with the tcp btl parameters that I've provided? (I > assume you have TCP interconnect, let me know if it's not a case). > > чт, 7 дек. 2017 г. в 12:03, Glenn (Gedaliah) Wolosh <gwol...@njit.edu > <mailto:gwol...@njit.edu>>: >> On Dec 7, 2017, at 1:18 PM, Artem Polyakov <artpo...@gmail.com >> <mailto:artpo...@gmail.com>> wrote: >> >> Couple of things to try to locate the issue: >> >> 1. To make sure that PMI is not working: have you tried to run something >> simple (like hello_world >> (https://github.com/open-mpi/ompi/blob/master/examples/hello_c.c >> <https://github.com/open-mpi/ompi/blob/master/examples/hello_c.c>) and ring >> (https://github.com/open-mpi/ompi/blob/master/examples/ring_c.c >> <https://github.com/open-mpi/ompi/blob/master/examples/ring_c.c>). Please >> try to run those two and post the results. >> 2. If hello is working and ring is not can you try to change the fabric to >> TCP: >> $ export OMPI_MCA_btl=tcp,self >> $ export OMPI_MCA_pml=ob1 >> $ srun ... >> >> Please provide the outputs
export OMPI_MCA_btl=tcp,self export OMPI_MCA_pml=ob1 srun --nodes=8 --ntasks-per-node=8 --ntasks=64 --mpi=pmi2 ./ep.C.64 This works — AS Parallel Benchmarks 3.3 -- EP Benchmark Number of random numbers generated: 8589934592 Number of active processes: 64 EP Benchmark Results: CPU Time = 5.9208 N = 2^ 32 No. Gaussian Pairs = 3373275903. Sums = 4.764367927992081D+04 -8.084072988045549D+04 Counts: 0 1572172634. 1 1501108549. 2 281805648. 3 17761221. 4 424017. 5 3821. 6 13. 7 0. 8 0. 9 0. EP Benchmark Completed. Class = C Size = 8589934592 Iterations = 0 Time in seconds = 5.92 Total processes = 64 Compiled procs = 64 Mop/s total = 1450.82 Mop/s/process = 22.67 Operation type = Random numbers generated Verification = SUCCESSFUL Version = 3.3.1 Compile date = 07 Dec 2017 Compile options: MPIF77 = mpif77 FLINK = $(MPIF77) FMPI_LIB = -L/opt/local/easybuild/software/Compiler/GC... FMPI_INC = -I/opt/local/easybuild/software/Compiler/GC... FFLAGS = -O FLINKFLAGS = -O RAND = randi8 Please send feedbacks and/or the results of this run to: NPB Development Team Internet: n...@nas.nasa.gov Hmm... > srun --mpi=pmi2 --ntasks-per-node=8 --ntasks=16 ./hello_c > hello_c.out > > Hello, world, I am 24 of 32, (Open MPI v1.10.3, package: Open MPI > gwol...@snode2.p-stheno.tartan.njit.edu > <mailto:gwol...@snode2.p-stheno.tartan.njit.edu> Distribution, ident: 1.10.3, > repo rev: v1.10.2-251-g9acf492, Jun 14, 2016, 150) > Hello, world, I am 0 of 32, (Open MPI v1.10.3, package: Open MPI > gwol...@snode2.p-stheno.tartan.njit.edu > <mailto:gwol...@snode2.p-stheno.tartan.njit.edu> Distribution, ident: 1.10.3, > repo rev: v1.10.2-251-g9acf492, Jun 14, 2016, 150) > Hello, world, I am 25 of 32, (Open MPI v1.10.3, package: Open MPI > gwol...@snode2.p-stheno.tartan.njit.edu > <mailto:gwol...@snode2.p-stheno.tartan.njit.edu> Distribution, ident: 1.10.3, > repo rev: v1.10.2-251-g9acf492, Jun 14, 2016, 150) > Hello, world, I am 1 of 32, (Open MPI v1.10.3, package: Open MPI > gwol...@snode2.p-stheno.tartan.njit.edu > <mailto:gwol...@snode2.p-stheno.tartan.njit.edu> Distribution, ident: 1.10.3, > repo rev: v1.10.2-251-g9acf492, Jun 14, 2016, 150) > Hello, world, I am 27 of 32, (Open MPI v1.10.3, package: Open MPI > gwol...@snode2.p-stheno.tartan.njit.edu > <mailto:gwol...@snode2.p-stheno.tartan.njit.edu> Distribution, ident: 1.10.3, > repo rev: v1.10.2-251-g9acf492, Jun 14, 2016, 150) > Hello, world, I am 2 of 32, (Open MPI v1.10.3, package: Open MPI > gwol...@snode2.p-stheno.tartan.njit.edu > <mailto:gwol...@snode2.p-stheno.tartan.njit.edu> Distribution, ident: 1.10.3, > repo rev: v1.10.2-251-g9acf492, Jun 14, 2016, 150) > Hello, world, I am 29 of 32, (Open MPI v1.10.3, package: Open MPI > gwol...@snode2.p-stheno.tartan.njit.edu > <mailto:gwol...@snode2.p-stheno.tartan.njit.edu> Distribution, ident: 1.10.3, > repo rev: v1.10.2-251-g9acf492, Jun 14, 2016, 150) > Hello, world, I am 31 of 32, (Open MPI v1.10.3, package: Open MPI > gwol...@snode2.p-stheno.tartan.njit.edu > <mailto:gwol...@snode2.p-stheno.tartan.njit.edu> Distribution, ident: 1.10.3, > repo rev: v1.10.2-251-g9acf492, Jun 14, 2016, 150) > Hello, world, I am 30 of 32, (Open MPI v1.10.3, package: Open MPI > gwol...@snode2.p-stheno.tartan.njit.edu > <mailto:gwol...@snode2.p-stheno.tartan.njit.edu> Distribution, ident: 1.10.3, > repo rev: v1.10.2-251-g9acf492, Jun 14, 2016, 150) > Hello, world, I am 4 of 32, (Open MPI v1.10.3, package: Open MPI > gwol...@snode2.p-stheno.tartan.njit.edu > <mailto:gwol...@snode2.p-stheno.tartan.njit.edu> Distribution, ident: 1.10.3, > repo rev: v1.10.2-251-g9acf492, Jun 14, 2016, 150) > Hello, world, I am 5 of 32, (Open MPI v1.10.3, package: Open MPI > gwol...@snode2.p-stheno.tartan.njit.edu > <mailto:gwol...@snode2.p-stheno.tartan.njit.edu> Distribution, ident: 1.10.3, > repo rev: v1.10.2-251-g9acf492, Jun 14, 2016, 150) > Hello, world, I am 17 of 32, (Open MPI v1.10.3, package: Open MPI > gwol...@snode2.p-stheno.tartan.njit.edu > <mailto:gwol...@snode2.p-stheno.tartan.njit.edu> Distribution, ident: 1.10.3, > repo rev: v1.10.2-251-g9acf492, Jun 14, 2016, 150) > Hello, world, I am 3 of 32, (Open MPI v1.10.3, package: Open MPI > gwol...@snode2.p-stheno.tartan.njit.edu > <mailto:gwol...@snode2.p-stheno.tartan.njit.edu> Distribution, ident: 1.10.3, > repo rev: v1.10.2-251-g9acf492, Jun 14, 2016, 150) > Hello, world, I am 7 of 32, (Open MPI v1.10.3, package: Open MPI > gwol...@snode2.p-stheno.tartan.njit.edu > <mailto:gwol...@snode2.p-stheno.tartan.njit.edu> Distribution, ident: 1.10.3, > repo rev: v1.10.2-251-g9acf492, Jun 14, 2016, 150) > Hello, world, I am 6 of 32, (Open MPI v1.10.3, package: Open MPI > gwol...@snode2.p-stheno.tartan.njit.edu > <mailto:gwol...@snode2.p-stheno.tartan.njit.edu> Distribution, ident: 1.10.3, > repo rev: v1.10.2-251-g9acf492, Jun 14, 2016, 150) > Hello, world, I am 18 of 32, (Open MPI v1.10.3, package: Open MPI > gwol...@snode2.p-stheno.tartan.njit.edu > <mailto:gwol...@snode2.p-stheno.tartan.njit.edu> Distribution, ident: 1.10.3, > repo rev: v1.10.2-251-g9acf492, Jun 14, 2016, 150) > Hello, world, I am 22 of 32, (Open MPI v1.10.3, package: Open MPI > gwol...@snode2.p-stheno.tartan.njit.edu > <mailto:gwol...@snode2.p-stheno.tartan.njit.edu> Distribution, ident: 1.10.3, > repo rev: v1.10.2-251-g9acf492, Jun 14, 2016, 150) > Hello, world, I am 23 of 32, (Open MPI v1.10.3, package: Open MPI > gwol...@snode2.p-stheno.tartan.njit.edu > <mailto:gwol...@snode2.p-stheno.tartan.njit.edu> Distribution, ident: 1.10.3, > repo rev: v1.10.2-251-g9acf492, Jun 14, 2016, 150) > Hello, world, I am 19 of 32, (Open MPI v1.10.3, package: Open MPI > gwol...@snode2.p-stheno.tartan.njit.edu > <mailto:gwol...@snode2.p-stheno.tartan.njit.edu> Distribution, ident: 1.10.3, > repo rev: v1.10.2-251-g9acf492, Jun 14, 2016, 150) > Hello, world, I am 9 of 32, (Open MPI v1.10.3, package: Open MPI > gwol...@snode2.p-stheno.tartan.njit.edu > <mailto:gwol...@snode2.p-stheno.tartan.njit.edu> Distribution, ident: 1.10.3, > repo rev: v1.10.2-251-g9acf492, Jun 14, 2016, 150) > Hello, world, I am 20 of 32, (Open MPI v1.10.3, package: Open MPI > gwol...@snode2.p-stheno.tartan.njit.edu > <mailto:gwol...@snode2.p-stheno.tartan.njit.edu> Distribution, ident: 1.10.3, > repo rev: v1.10.2-251-g9acf492, Jun 14, 2016, 150) > Hello, world, I am 8 of 32, (Open MPI v1.10.3, package: Open MPI > gwol...@snode2.p-stheno.tartan.njit.edu > <mailto:gwol...@snode2.p-stheno.tartan.njit.edu> Distribution, ident: 1.10.3, > repo rev: v1.10.2-251-g9acf492, Jun 14, 2016, 150) > Hello, world, I am 10 of 32, (Open MPI v1.10.3, package: Open MPI > gwol...@snode2.p-stheno.tartan.njit.edu > <mailto:gwol...@snode2.p-stheno.tartan.njit.edu> Distribution, ident: 1.10.3, > repo rev: v1.10.2-251-g9acf492, Jun 14, 2016, 150) > Hello, world, I am 13 of 32, (Open MPI v1.10.3, package: Open MPI > gwol...@snode2.p-stheno.tartan.njit.edu > <mailto:gwol...@snode2.p-stheno.tartan.njit.edu> Distribution, ident: 1.10.3, > repo rev: v1.10.2-251-g9acf492, Jun 14, 2016, 150) > Hello, world, I am 11 of 32, (Open MPI v1.10.3, package: Open MPI > gwol...@snode2.p-stheno.tartan.njit.edu > <mailto:gwol...@snode2.p-stheno.tartan.njit.edu> Distribution, ident: 1.10.3, > repo rev: v1.10.2-251-g9acf492, Jun 14, 2016, 150) > Hello, world, I am 26 of 32, (Open MPI v1.10.3, package: Open MPI > gwol...@snode2.p-stheno.tartan.njit.edu > <mailto:gwol...@snode2.p-stheno.tartan.njit.edu> Distribution, ident: 1.10.3, > repo rev: v1.10.2-251-g9acf492, Jun 14, 2016, 150) > Hello, world, I am 16 of 32, (Open MPI v1.10.3, package: Open MPI > gwol...@snode2.p-stheno.tartan.njit.edu > <mailto:gwol...@snode2.p-stheno.tartan.njit.edu> Distribution, ident: 1.10.3, > repo rev: v1.10.2-251-g9acf492, Jun 14, 2016, 150) > Hello, world, I am 14 of 32, (Open MPI v1.10.3, package: Open MPI > gwol...@snode2.p-stheno.tartan.njit.edu > <mailto:gwol...@snode2.p-stheno.tartan.njit.edu> Distribution, ident: 1.10.3, > repo rev: v1.10.2-251-g9acf492, Jun 14, 2016, 150) > Hello, world, I am 28 of 32, (Open MPI v1.10.3, package: Open MPI > gwol...@snode2.p-stheno.tartan.njit.edu > <mailto:gwol...@snode2.p-stheno.tartan.njit.edu> Distribution, ident: 1.10.3, > repo rev: v1.10.2-251-g9acf492, Jun 14, 2016, 150) > Hello, world, I am 21 of 32, (Open MPI v1.10.3, package: Open MPI > gwol...@snode2.p-stheno.tartan.njit.edu > <mailto:gwol...@snode2.p-stheno.tartan.njit.edu> Distribution, ident: 1.10.3, > repo rev: v1.10.2-251-g9acf492, Jun 14, 2016, 150) > Hello, world, I am 15 of 32, (Open MPI v1.10.3, package: Open MPI > gwol...@snode2.p-stheno.tartan.njit.edu > <mailto:gwol...@snode2.p-stheno.tartan.njit.edu> Distribution, ident: 1.10.3, > repo rev: v1.10.2-251-g9acf492, Jun 14, 2016, 150) > Hello, world, I am 12 of 32, (Open MPI v1.10.3, package: Open MPI > gwol...@snode2.p-stheno.tartan.njit.edu > <mailto:gwol...@snode2.p-stheno.tartan.njit.edu> Distribution, ident: 1.10.3, > repo rev: v1.10.2-251-g9acf492, Jun 14, 2016, 150) > > srun --mpi=pmi2 --ntasks-per-node=8 --ntasks=16 --nodes=2 ./ring_c > > ring_c.out > > Process 1 exiting > Process 12 exiting > Process 14 exiting > Process 13 exiting > Process 3 exiting > Process 11 exiting > Process 5 exiting > Process 6 exiting > Process 2 exiting > Process 4 exiting > Process 9 exiting > Process 10 exiting > Process 7 exiting > Process 15 exiting > Process 0 sending 10 to 1, tag 201 (16 processes in ring) > Process 0 sent to 1 > Process 0 decremented value: 9 > Process 0 decremented value: 8 > Process 0 decremented value: 7 > Process 0 decremented value: 6 > Process 0 decremented value: 5 > Process 0 decremented value: 4 > Process 0 decremented value: 3 > Process 0 decremented value: 2 > Process 0 decremented value: 1 > Process 0 decremented value: 0 > Process 0 exiting > Process 8 exiting > >> >> 2017-12-07 10:05 GMT-08:00 Glenn (Gedaliah) Wolosh <gwol...@njit.edu >> <mailto:gwol...@njit.edu>>: >> >> >>> On Dec 7, 2017, at 12:51 PM, Artem Polyakov <artpo...@gmail.com >>> <mailto:artpo...@gmail.com>> wrote: >>> >>> also please post the output of >>> $ srun --mpi=list >> >> [gwolosh@p-slogin bin]$ srun --mpi=list >> srun: MPI types are... >> srun: mpi/mpich1_shmem >> srun: mpi/mpich1_p4 >> srun: mpi/lam >> srun: mpi/openmpi >> srun: mpi/none >> srun: mpi/mvapich >> srun: mpi/mpichmx >> srun: mpi/pmi2 >> srun: mpi/mpichgm >> >> >>> >>> When job crashes - is there any error messages in the relevant slurmd.log's >>> or output on the screen? >> >> on screen — >> >> [snode4][[274,1],24][connect/btl_openib_connect_udcm.c:1448:udcm_wait_for_send_completion] >> send failed with verbs status 2 >> [snode4:5175] *** An error occurred in MPI_Bcast >> [snode4:5175] *** reported by process [17956865,24] >> [snode4:5175] *** on communicator MPI_COMM_WORLD >> [snode4:5175] *** MPI_ERR_OTHER: known error not in list >> [snode4:5175] *** MPI_ERRORS_ARE_FATAL (processes in this communicator will >> now abort, >> [snode4:5175] *** and potentially your MPI job) >> mlx4: local QP operation err (QPN 0005f3, WQE index 40000, vendor syndrome >> 6c, opcode = 5e) >> srun: Job step aborted: Waiting up to 32 seconds for job step to finish. >> [snode4][[274,1],31][connect/btl_openib_connect_udcm.c:1448:udcm_wait_for_send_completion] >> send failed with verbs status 2 >> slurmstepd: error: *** STEP 274.0 ON snode1 CANCELLED AT 2017-12-07T12:55:46 >> *** >> [snode4:5182] *** An error occurred in MPI_Bcast >> [snode4:5182] *** reported by process [17956865,31] >> [snode4:5182] *** on communicator MPI_COMM_WORLD >> [snode4:5182] *** MPI_ERR_OTHER: known error not in list >> [snode4:5182] *** MPI_ERRORS_ARE_FATAL (processes in this communicator will >> now abort, >> [snode4:5182] *** and potentially your MPI job) >> mlx4: local QP operation err (QPN 0005f7, WQE index 40000, vendor syndrome >> 6c, opcode = 5e) >> [snode4][[274,1],27][connect/btl_openib_connect_udcm.c:1448:udcm_wait_for_send_completion] >> send failed with verbs status 2 >> [snode4:5178] *** An error occurred in MPI_Bcast >> [snode4:5178] *** reported by process [17956865,27] >> [snode4:5178] *** on communicator MPI_COMM_WORLD >> [snode4:5178] *** MPI_ERR_OTHER: known error not in list >> [snode4:5178] *** MPI_ERRORS_ARE_FATAL (processes in this communicator will >> now abort, >> [snode4:5178] *** and potentially your MPI job) >> mlx4: local QP operation err (QPN 0005fa, WQE index 40000, vendor syndrome >> 6c, opcode = 5e) >> srun: error: snode4: tasks 24,31: Exited with exit code 16 >> srun: error: snode4: tasks 25-30: Killed >> srun: error: snode5: tasks 32-39: Killed >> srun: error: snode3: tasks 16-23: Killed >> srun: error: snode8: tasks 56-63: Killed >> srun: error: snode7: tasks 48-55: Killed >> srun: error: snode1: tasks 0-7: Killed >> srun: error: snode2: tasks 8-15: Killed >> srun: error: snode6: tasks 40-47: Killed >> >> Nothing striking in the slurmd log >> >> >>> >>> 2017-12-07 9:49 GMT-08:00 Artem Polyakov <artpo...@gmail.com >>> <mailto:artpo...@gmail.com>>: >>> Hello, >>> >>> what is the value of MpiDefault option in your Slurm configuration file? >>> >>> 2017-12-07 9:37 GMT-08:00 Glenn (Gedaliah) Wolosh <gwol...@njit.edu >>> <mailto:gwol...@njit.edu>>: >>> Hello >>> >>> This is using Slurm version - 17.02.6 running on Scientific Linux release >>> 7.4 (Nitrogen) >>> >>> [gwolosh@p-slogin bin]$ module li >>> >>> Currently Loaded Modules: >>> 1) GCCcore/.5.4.0 (H) 2) binutils/.2.26 (H) 3) GCC/5.4.0-2.26 4) >>> numactl/2.0.11 5) hwloc/1.11.3 6) OpenMPI/1.10.3 >>> >>> If I run >>> >>> srun --nodes=8 --ntasks-per-node=8 --ntasks=64 ./ep.C.64 >>> >>> It runs successfuly but I get a message — >>> >>> PMI2 initialized but returned bad values for size/rank/jobid. >>> This is symptomatic of either a failure to use the >>> "--mpi=pmi2" flag in SLURM, or a borked PMI2 installation. >>> If running under SLURM, try adding "-mpi=pmi2" to your >>> srun command line. If that doesn't work, or if you are >>> not running under SLURM, try removing or renaming the >>> pmi2.h header file so PMI2 support will not automatically >>> be built, reconfigure and build OMPI, and then try again >>> with only PMI1 support enabled. >>> >>> If I run >>> >>> srun --nodes=8 --ntasks-per-node=8 --ntasks=64 —mpi=pmi2 ./ep.C.64 >>> >>> The job crashes >>> >>> If I run via sbatch — >>> >>> #!/bin/bash >>> # Job name: >>> #SBATCH --job-name=nas_bench >>> #SBATCH --nodes=8 >>> #SBATCH --ntasks=64 >>> #SBATCH --ntasks-per-node=8 >>> #SBATCH --time=48:00:00 >>> #SBATCH --output=nas.out.1 >>> # >>> ## Command(s) to run (example): >>> module use $HOME/easybuild/modules/all/Core >>> module load GCC/5.4.0-2.26 OpenMPI/1.10.3 >>> mpirun -np 64 ./ep.C.64 >>> >>> the job crashes >>> >>> Using easybuild, these are my config options for ompi — >>> >>> configopts = '--with-threads=posix --enable-shared >>> --enable-mpi-thread-multiple --with-verbs ' >>> configopts += '--enable-mpirun-prefix-by-default ' # suppress failure >>> modes in relation to mpirun path >>> configopts += '--with-hwloc=$EBROOTHWLOC ' # hwloc support >>> configopts += '--disable-dlopen ' # statically link component, don't do >>> dynamic loading >>> configopts += '--with-slurm --with-pmi ‘ >>> >>> And finally — >>> >>> $ ldd >>> /opt/local/easybuild/software/Compiler/GCC/5.4.0-2.26/OpenMPI/1.10.3/bin/orterun >>> | grep pmi >>> libpmi.so.0 => /usr/lib64/libpmi.so.0 (0x00007f0129d6d000) >>> libpmi2.so.0 => /usr/lib64/libpmi2.so.0 (0x00007f0129b51000) >>> >>> $ ompi_info | grep pmi >>> MCA db: pmi (MCA v2.0.0, API v1.0.0, Component v1.10.3) >>> MCA ess: pmi (MCA v2.0.0, API v3.0.0, Component v1.10.3) >>> MCA grpcomm: pmi (MCA v2.0.0, API v2.0.0, Component v1.10.3) >>> MCA pubsub: pmi (MCA v2.0.0, API v2.0.0, Component v1.10.3) >>> >>> >>> Any suggestions? >>> _______________ >>> Gedaliah Wolosh >>> IST Academic and Research Computing Systems (ARCS) >>> NJIT >>> GITC 2203 >>> 973 596 5437 <tel:(973)%20596-5437> >>> gwol...@njit.edu <mailto:gwol...@njit.edu> >>> >>> >>> >>> >>> -- >>> С Уважением, Поляков Артем Юрьевич >>> Best regards, Artem Y. Polyakov >>> >>> >>> >>> -- >>> С Уважением, Поляков Артем Юрьевич >>> Best regards, Artem Y. Polyakov >> >> >> >> >> -- >> С Уважением, Поляков Артем Юрьевич >> Best regards, Artem Y. Polyakov > > -- > ----- Best regards, Artem Polyakov (Mobile mail)