Whoops, this is actually Cori-KNL. On Wed, Apr 15, 2020 at 4:33 PM Mark Adams <mfad...@lbl.gov> wrote:
> We have a problem when going from 32K to 64K cores on Cori-haswell. > Does Anyone have any thoughts? > Thanks, > Mark > > ---------- Forwarded message --------- > From: David Trebotich <dptrebot...@lbl.gov> > Date: Wed, Apr 15, 2020 at 4:20 PM > Subject: Re: petsc on Cori Haswell > To: Mark Adams <mfad...@lbl.gov> > > > Hey Mark- > I am running into some issues that I am convinced are from the PETSc > build. I am able to build and run on up to 32K cores. At 64K I start > getting stuff like below (looks like two issues: pmi stuff and MPI_Init). I > have been working with Brian Freisen to see if it's a NERSC problem. At > this point I build without PETSc and then run native gmg in Chombo and have > no problems. The problems only come with building with PETSc, and at larger > concurrencies. The only thing that has changed is that this is a new PETSc > installation. Perhaps something changed in the PETSc version you built from > previously? Thanks for the help. > Treb > > Mon Apr 13 17:49:45 2020: [PE_101955]:_pmi_mmap_tmp: Warning bootstrap > barrier failed: num_syncd=33, pes_this_node=64, timeout=180 secs > Mon Apr 13 17:49:45 2020: [PE_101958]:_pmi_mmap_tmp: Warning bootstrap > barrier failed: num_syncd=33, pes_this_node=64, timeout=180 secs > Mon Apr 13 17:49:45 2020: [PE_101958]:_pmi_init:_pmi_mmap_init returned -1 > Mon Apr 13 17:49:45 2020: [PE_101979]:_pmi_mmap_tmp: Warning bootstrap > barrier failed: num_syncd=33, pes_this_node=64, timeout=180 secs > Mon Apr 13 17:49:45 2020: [PE_101979]:_pmi_init:_pmi_mmap_init returned -1 > Mon Apr 13 17:49:45 2020: [PE_82712]:_pmi_mmap_tmp: Warning bootstrap > barrier failed: num_syncd=28, pes_this_node=64, timeout=180 secs > Mon Apr 13 17:49:45 2020: [PE_17868]:_pmi_mmap_tmp: Warning bootstrap > barrier failed: num_syncd=32, pes_this_node=64, timeout=180 secs > Mon Apr 13 17:49:45 2020: [PE_97918]:_pmi_mmap_tmp: Warning bootstrap > barrier failed: num_syncd=33, pes_this_node=64, timeout=180 secs > Mon Apr 13 17:49:45 2020: [PE_17869]:_pmi_mmap_tmp: Warning bootstrap > barrier failed: num_syncd=32, pes_this_node=64, timeout=180 secs > Mon Apr 13 17:49:45 2020: [PE_17869]:_pmi_init:_pmi_mmap_init returned -1 > Mon Apr 13 17:49:45 2020: [PE_110562]:_pmi_mmap_tmp: Warning bootstrap > barrier failed: num_syncd=27, pes_this_node=64, timeout=180 secs > Mon Apr 13 17:49:45 2020: [PE_110562]:_pmi_init:_pmi_mmap_init returned -1 > Mon Apr 13 17:49:45 2020: [PE_110563]:_pmi_mmap_tmp: Warning bootstrap > barrier failed: num_syncd=27, pes_this_node=64, timeout=180 secs > Mon Apr 13 17:49:45 2020: [PE_27899]:_pmi_mmap_tmp: Warning bootstrap > barrier failed: num_syncd=38, pes_this_node=64, timeout=180 secs > [Mon Apr 13 17:49:45 2020] [c7-4c1s6n0] Fatal error in MPI_Init: Other MPI > error, error stack: > MPIR_Init_thread(537): > MPID_Init(246).......: channel initialization failed > MPID_Init(647).......: PMI2 init failed: 1 > Attempting to use an MPI routine before initializing MPICH > [Mon Apr 13 17:49:45 2020] [c7-4c1s6n0] Fatal error in MPI_Init: Other MPI > error, error stack: > MPIR_Init_thread(537): > MPID_Init(246).......: channel initialization failed > MPID_Init(647).......: PMI2 init failed: 1 > Attempting to use an MPI routine before initializing MPICH > Mon Apr 13 17:49:45 2020: [PE_71961]:_pmi_mmap_tmp: Warning bootstrap > barrier failed: num_syncd=35, pes_this_node=64, timeout=180 secs > Mon Apr 13 17:49:45 2020: [PE_71961]:_pmi_init:_pmi_mmap_init returned -1 > Mon Apr 13 17:49:45 2020: [PE_71962]:_pmi_mmap_tmp: Warning bootstrap > barrier failed: num_syncd=35, pes_this_node=64, timeout=180 secs > Mon Apr 13 17:49:45 2020: [PE_64329]:_pmi_mmap_tmp: Warning bootstrap > barrier failed: num_syncd=32, pes_this_node=64, timeout=180 secs > Mon Apr 13 17:49:45 2020: [PE_64335]:_pmi_mmap_tmp: Warning bootstrap > barrier failed: num_syncd=32, pes_this_node=64, timeout=180 secs > Mon Apr 13 17:49:45 2020: [PE_64335]:_pmi_init:_pmi_mmap_init returned -1 > [Mon Apr 13 17:49:45 2020] [c6-1c2s5n2] Fatal error in MPI_Init: Other MPI > error, error stack: > MPIR_Init_thread(537): > MPID_Init(246).......: channel initialization failed > MPID_Init(647).......: PMI2 init failed: 1 > Attempting to use an MPI routine before initializing MPICH > [Mon Apr 13 17:49:45 2020] [c9-4c2s13n2] Fatal error in MPI_Init: Other > MPI error, error stack: > MPIR_Init_thread(537): > MPID_Init(246).......: channel initialization failed > MPID_Init(647).......: PMI2 init failed: 1 > Attempting to use an MPI routine before initializing MPICH > Mon Apr 13 17:49:45 2020: [PE_71960]:_pmi_mmap_tmp: Warning bootstrap > barrier failed: num_syncd=35, pes_this_node=64, timeout=180 secs > Mon Apr 13 17:49:45 2020: [PE_71960]:_pmi_init:_pmi_mmap_init returned -1 > [Mon Apr 13 17:49:45 2020] [c6-3c2s9n1] Fatal error in MPI_Init: Other MPI > error, error stack: > MPIR_Init_thread(537): > MPID_Init(246).......: channel initialization failed > >