On Mon, Jul 20, 2020 at 2:26 PM Chris Hewson <ch...@resfrac.com> wrote:
> Chris is using Haswell, what MPI are you using? I trust you are not using > Moose. > - yes, using haswell, mpi is mpich v3.3a2 on ubuntu 18.04. I am not using > MOOSE. > Do not use mpich v3.3a2, which is an alpha version released in 2016. Use current stable release mpich-3.3.2 > > > > *Chris Hewson* > Senior Reservoir Simulation Engineer > ResFrac > +1.587.575.9792 > > > On Mon, Jul 20, 2020 at 1:14 PM Mark Adams <mfad...@lbl.gov> wrote: > >> This is indeed a nasty bug, but having two separate should be useful. >> >> Chris is using Haswell, what MPI are you using? I trust you are not using >> Moose. >> >> Fande what machine/MPI are you using? >> >> On Mon, Jul 20, 2020 at 3:04 PM Chris Hewson <ch...@resfrac.com> wrote: >> >>> Hi Mark, >>> >>> Chris: It sounds like you just have one matrix that you give to MUMPS. >>> You seem to be creating a matrix in the middle of your run. Are you doing >>> dynamic adaptivity? >>> - I have 2 separate matrices I give to mumps, but as this is happening >>> in the production build of my code, I can't determine with certainty what >>> call to MUMPS it's happening or what call to KSPBCGS or UMFPACK it's >>> happening in. >>> >>> I do destroy and recreate matrices in the middle of my runs, but this >>> happens multiple times before the fault happens and in (presumably) the >>> same way. I also do checks on matrix sizes and what I am sending to PETSc >>> and those all pass, just at some point there are size mismatches >>> somewhere, understandably this is not a lot to go on. I am not doing >>> dynamic adaptivity, the mesh is instead changing its size. >>> >>> And I agree with Fande, the most frustrating part is that it's not >>> reproducible, but yah not 100% sure that the problem lies within the PETSc >>> code base either. >>> >>> Current working theories are: >>> 1. Some sort of MPI problem with the sending of one the matrix elements >>> (using mpich version 3.3a2) >>> 2. Some of the memory of static pointers gets corrupted, although I >>> would expect a garbage number and not something that could possibly make >>> sense. >>> >>> *Chris Hewson* >>> Senior Reservoir Simulation Engineer >>> ResFrac >>> +1.587.575.9792 >>> >>> >>> On Mon, Jul 20, 2020 at 12:41 PM Mark Adams <mfad...@lbl.gov> wrote: >>> >>>> >>>> >>>> On Mon, Jul 20, 2020 at 2:36 PM Fande Kong <fdkong...@gmail.com> wrote: >>>> >>>>> Hi Mark, >>>>> >>>>> Just to be clear, I do not think it is related to GAMG or PtAP. It is >>>>> a communication issue: >>>>> >>>> >>>> Youe stack trace was from PtAP, but Chris's problem is not. >>>> >>>> >>>>> >>>>> Reran the same code, and I just got : >>>>> >>>>> [252]PETSC ERROR: --------------------- Error Message >>>>> -------------------------------------------------------------- >>>>> [252]PETSC ERROR: Petsc has generated inconsistent data >>>>> [252]PETSC ERROR: Received vector entry 4469094877509280860 out of >>>>> local range [255426072,256718616)] >>>>> >>>> >>>> OK, now this (4469094877509280860) is clearly garbage. THat is the >>>> important thing. I have to think your MPI is buggy. >>>> >>>> >>>>