Barry,

If you look at the graphs I generated (on my Mac),  you will see that OpenMPI and MPICH have very different values (along with the fact that MPICH does not seem to adhere
to the standard (for releasing MPI_ISend resources following and MPI_Wait).

-sanjay

PS: I agree with Barry's assessment; this is really not that acceptable.

On 6/1/19 1:00 AM, Smith, Barry F. wrote:
   Junchao,

      This is insane. Either the OpenMPI library or something in the OS 
underneath related to sockets and interprocess communication is grabbing 
additional space for each round of MPI communication!  Does MPICH have the same 
values or different values than OpenMP? When you run on Linux do you get the 
same values as Apple or different. --- Same values seem to indicate the issue 
is inside OpenMPI/MPICH different values indicates problem is more likely at 
the OS level. Does this happen only with the default VecScatter that uses 
blocking MPI, what happens with PetscSF under Vec? Is it somehow related to 
PETSc's use of nonblocking sends and receives? One could presumably use 
valgrind to see exactly what lines in what code are causing these increases. I 
don't think we can just shrug and say this is the way it is, we need to track 
down and understand the cause (and if possible fix).

   Barry


On May 31, 2019, at 2:53 PM, Zhang, Junchao <jczh...@mcs.anl.gov> wrote:

Sanjay,
I tried petsc with MPICH and OpenMPI on my Macbook. I inserted 
PetscMemoryGetCurrentUsage/PetscMallocGetCurrentUsage at the beginning and end 
of KSPSolve and then computed the delta and summed over processes. Then I 
tested with src/ts/examples/tutorials/advection-diffusion-reaction/ex5.c
With OpenMPI,
mpirun -n 4 ./ex5 -da_grid_x 128 -da_grid_y 128 -ts_type beuler -ts_max_steps 500 
> 128.log
grep -n -v "RSS Delta=         0, Malloc Delta=         0" 128.log
1:RSS Delta=     69632, Malloc Delta=         0
2:RSS Delta=     69632, Malloc Delta=         0
3:RSS Delta=     69632, Malloc Delta=         0
4:RSS Delta=     69632, Malloc Delta=         0
9:RSS Delta=9.25286e+06, Malloc Delta=         0
22:RSS Delta=     49152, Malloc Delta=         0
44:RSS Delta=     20480, Malloc Delta=         0
53:RSS Delta=     49152, Malloc Delta=         0
66:RSS Delta=      4096, Malloc Delta=         0
97:RSS Delta=     16384, Malloc Delta=         0
119:RSS Delta=     20480, Malloc Delta=         0
141:RSS Delta=     53248, Malloc Delta=         0
176:RSS Delta=     16384, Malloc Delta=         0
308:RSS Delta=     16384, Malloc Delta=         0
352:RSS Delta=     16384, Malloc Delta=         0
550:RSS Delta=     16384, Malloc Delta=         0
572:RSS Delta=     16384, Malloc Delta=         0
669:RSS Delta=     40960, Malloc Delta=         0
924:RSS Delta=     32768, Malloc Delta=         0
1694:RSS Delta=     20480, Malloc Delta=         0
2099:RSS Delta=     16384, Malloc Delta=         0
2244:RSS Delta=     20480, Malloc Delta=         0
3001:RSS Delta=     16384, Malloc Delta=         0
5883:RSS Delta=     16384, Malloc Delta=         0

If I increased the grid
mpirun -n 4 ./ex5 -da_grid_x 512 -da_grid_y 512 -ts_type beuler -ts_max_steps 500 
-malloc_test >512.log
grep -n -v "RSS Delta=         0, Malloc Delta=         0" 512.log
1:RSS Delta=1.05267e+06, Malloc Delta=         0
2:RSS Delta=1.05267e+06, Malloc Delta=         0
3:RSS Delta=1.05267e+06, Malloc Delta=         0
4:RSS Delta=1.05267e+06, Malloc Delta=         0
13:RSS Delta=1.24932e+08, Malloc Delta=         0

So we did see RSS increase in 4k-page sizes after KSPSolve. As long as no 
memory leaks, why do you care about it? Is it because you run out of memory?

On Thu, May 30, 2019 at 1:59 PM Smith, Barry F. <bsm...@mcs.anl.gov> wrote:

    Thanks for the update. So the current conclusions are that using the 
Waitall in your code

1) solves the memory issue with OpenMPI in your code

2) does not solve the memory issue with PETSc KSPSolve

3) MPICH has memory issues both for your code and PETSc KSPSolve (despite) the 
wait all fix?

If you literately just comment out the call to KSPSolve() with OpenMPI is there 
no growth in memory usage?


Both 2 and 3 are concerning, indicate possible memory leak bugs in MPICH and 
not freeing all MPI resources in KSPSolve()

Junchao, can you please investigate 2 and 3 with, for example, a TS example 
that uses the linear solver (like with -ts_type beuler)? Thanks


   Barry



On May 30, 2019, at 1:47 PM, Sanjay Govindjee <s...@berkeley.edu> wrote:

Lawrence,
Thanks for taking a look!  This is what I had been wondering about -- my 
knowledge of MPI is pretty minimal and
this origins of the routine were from a programmer we hired a decade+ back from 
NERSC.  I'll have to look into
VecScatter.  It will be great to dispense with our roll-your-own routines (we 
even have our own reduceALL scattered around the code).

Interestingly, the MPI_WaitALL has solved the problem when using OpenMPI but it 
still persists with MPICH.  Graphs attached.
I'm going to run with openmpi for now (but I guess I really still need to 
figure out what is wrong with MPICH and WaitALL;
I'll try Barry's suggestion of 
--download-mpich-configure-arguments="--enable-error-messages=all --enable-g" 
later today and report back).

Regarding MPI_Barrier, it was put in due a problem that some processes were 
finishing up sending and receiving and exiting the subroutine
before the receiving processes had completed (which resulted in data loss as 
the buffers are freed after the call to the routine). MPI_Barrier was the 
solution proposed
to us.  I don't think I can dispense with it, but will think about some more.

I'm not so sure about using MPI_IRecv as it will require a bit of rewriting 
since right now I process the received
data sequentially after each blocking MPI_Recv -- clearly slower but easier to 
code.

Thanks again for the help.

-sanjay

On 5/30/19 4:48 AM, Lawrence Mitchell wrote:
Hi Sanjay,

On 30 May 2019, at 08:58, Sanjay Govindjee via petsc-users 
<petsc-users@mcs.anl.gov> wrote:

The problem seems to persist but with a different signature.  Graphs attached 
as before.

Totals with MPICH (NB: single run)

For the CG/Jacobi          data_exchange_total = 41,385,984; kspsolve_total = 
38,289,408
For the GMRES/BJACOBI      data_exchange_total = 41,324,544; kspsolve_total = 
41,324,544

Just reading the MPI docs I am wondering if I need some sort of 
MPI_Wait/MPI_Waitall before my MPI_Barrier in the data exchange routine?
I would have thought that with the blocking receives and the MPI_Barrier that 
everything will have fully completed and cleaned up before
all processes exited the routine, but perhaps I am wrong on that.
Skimming the fortran code you sent you do:

for i in ...:
    call MPI_Isend(..., req, ierr)

for i in ...:
    call MPI_Recv(..., ierr)

But you never call MPI_Wait on the request you got back from the Isend. So the 
MPI library will never free the data structures it created.

The usual pattern for these non-blocking communications is to allocate an array 
for the requests of length nsend+nrecv and then do:

for i in nsend:
    call MPI_Isend(..., req[i], ierr)
for j in nrecv:
    call MPI_Irecv(..., req[nsend+j], ierr)

call MPI_Waitall(req, ..., ierr)

I note also there's no need for the Barrier at the end of the routine, this 
kind of communication does neighbourwise synchronisation, no need to add 
(unnecessary) global synchronisation too.

As an aside, is there a reason you don't use PETSc's VecScatter to manage this 
global to local exchange?

Cheers,

Lawrence
<cg_mpichwall.png><cg_wall.png><gmres_mpichwall.png><gmres_wall.png>

Reply via email to