Yes, I did. The memory check for rss computes the memory footprint of column index using size of unsigned long long instead of int.
For Junchao, I wonder if keeping track of which loaded columns are owned by the current process and which loaded columns are not owned also needs some memory storage. Just a wild thought. On Thu, Jul 2, 2020 at 11:40 PM Ernesto Prudencio <epruden...@slb.com> wrote: > Karl, > > > > > > > > > > > > > > > > > > > > > > > > * Are you taking into account that every “integer” index might be 64 > bits instead of 32 bits, depending on the PETSc configuration / compilation > choices for PetscInt? Ernesto. From: petsc-users > [mailto:petsc-users-boun...@mcs.anl.gov <petsc-users-boun...@mcs.anl.gov>] > On Behalf Of Junchao Zhang Sent: Thursday, July 2, 2020 11:21 PM To: Karl > Lin <karl.lin...@gmail.com <karl.lin...@gmail.com>> Cc: PETSc users list > <petsc-users@mcs.anl.gov <petsc-users@mcs.anl.gov>> Subject: [Ext] Re: > [petsc-users] matcreate and assembly issue Is it because indices for the > nonzeros also need memory? --Junchao Zhang On Thu, Jul 2, 2020 at 10:04 > PM Karl Lin <karl.lin...@gmail.com <karl.lin...@gmail.com>> wrote: Hi, > Matthew Thanks for the reply. However, I don't really get why additional > malloc would double the memory footprint. If I know there is only 1GB > matrix being loaded, there shouldn't be 2GB memory occupied even if Petsc > needs to allocate more space. regards, Karl On Thu, Jul 2, 2020 at > 8:10 PM Matthew Knepley <knep...@gmail.com <knep...@gmail.com>> wrote: On > Thu, Jul 2, 2020 at 7:30 PM Karl Lin <karl.lin...@gmail.com > <karl.lin...@gmail.com>> wrote: Hi, Matt Thanks for the tip last time. We > just encountered another issue with large data sets. This time the behavior > is the opposite from last time. The data is 13.5TB, the total number of > matrix columns is 2.4 billion. Our program crashed during matrix loading > due to memory overflow in one node. As said before, we have a little memory > check during loading the matrix to keep track of rss. The printout of rss > in the log shows normal increase in many nodes, i.e., if we load in a > portion of the matrix that is 1GB, after MatSetValues for that portion, rss > will increase roughly about 1GB. On the node that has memory overflow, the > rss increased by 2GB after only 1GB of matrix is loaded through > MatSetValues. We are very puzzled by this. What could make the memory > footprint twice as much as needed? Thanks in advance for any insight. The > only way I can imagine this happening is that you have not preallocated > correctly, so that some values are causing additional mallocs. Thanks, > Matt Regards, Karl On Thu, Jun 11, 2020 at 12:00 PM Matthew > Knepley <knep...@gmail.com <knep...@gmail.com>> wrote: On Thu, Jun 11, 2020 > at 12:52 PM Karl Lin <karl.lin...@gmail.com <karl.lin...@gmail.com>> wrote: > Hi, Matthew Thanks for the suggestion, just did another run and here are > some detailed stack traces, maybe will provide some more insight: *** > Process received signal *** Signal: Aborted (6) Signal code: (-6) > /lib64/libpthread.so.0(+0xf5f0)[0x2b56c46dc5f0] [ 1] > /lib64/libc.so.6(gsignal+0x37)[0x2b56c5486337] [ 2] > /lib64/libc.so.6(abort+0x148)[0x2b56c5487a28] [ 3] > /libpetsc.so.3.10(PetscTraceBackErrorHandler+0xc4)[0x2b56c1e6a2d4] [ 4] > /libpetsc.so.3.10(PetscError+0x1b5)[0x2b56c1e69f65] [ 5] > /libpetsc.so.3.10(PetscCommBuildTwoSidedFReq+0x19f0)[0x2b56c1e03cf0] [ 6] > /libpetsc.so.3.10(+0x77db17)[0x2b56c2425b17] [ 7] > /libpetsc.so.3.10(+0x77a164)[0x2b56c2422164] [ 8] > /libpetsc.so.3.10(MatAssemblyBegin_MPIAIJ+0x36)[0x2b56c23912b6] [ 9] > /libpetsc.so.3.10(MatAssemblyBegin+0xca)[0x2b56c1feccda] By > reconfiguring, you mean recompiling petsc with that option, correct? > Reconfiguring. Thanks, Matt Thank you. Karl On Thu, Jun 11, > 2020 at 10:56 AM Matthew Knepley <knep...@gmail.com <knep...@gmail.com>> > wrote: On Thu, Jun 11, 2020 at 11:51 AM Karl Lin <karl.lin...@gmail.com > <karl.lin...@gmail.com>> wrote: Hi, there We have written a program using > Petsc to solve large sparse matrix system. It has been working fine for a > while. Recently we encountered a problem when the size of the sparse matrix > is larger than 10TB. We used several hundred nodes and 2200 processes. The > program always crashes during MatAssemblyBegin.Upon a closer look, there > seems to be something unusual. We have a little memory check during loading > the matrix to keep track of rss. The printout of rss in the log shows > normal increase up to rank 2160, i.e., if we load in a portion of matrix > that is 1GB, after MatSetValues for that portion, rss will increase roughly > about that number. From rank 2161 onwards, the rss in every rank doesn't > increase after matrix loaded. Then comes MatAssemblyBegin, the program > crashed on rank 2160. Is there a upper limit on the number of processes > Petsc can handle? or is there a upper limit in terms of the size of the > matrix petsc can handle? Thank you very much for any info. It sounds like > you overflowed int somewhere. We try and check for this, but catching every > place is hard. Try reconfiguring with --with-64-bit-indices Thanks, > Matt Regards, Karl -- What most experimenters take for > granted before they begin their experiments is infinitely more interesting > than any results to which their experiments lead. -- Norbert Wiener > https://www.cse.buffalo.edu/~knepley/ > <https://urldefense.com/v3/__http:/www.cse.buffalo.edu/*knepley/__;fg!!Kjv0uj3L4nM6H-I!1KBn92fUc-8pAvJy257WTFoHD80IUf6u5iIhyL_vrliEm3psAK4KAJFCdygnPA$> > -- What most experimenters take for granted before they begin their > experiments is infinitely more interesting than any results to which their > experiments lead. -- Norbert Wiener https://www.cse.buffalo.edu/~knepley/ > <https://urldefense.com/v3/__http:/www.cse.buffalo.edu/*knepley/__;fg!!Kjv0uj3L4nM6H-I!1KBn92fUc-8pAvJy257WTFoHD80IUf6u5iIhyL_vrliEm3psAK4KAJFCdygnPA$> > -- What most experimenters take for granted before they begin their > experiments is infinitely more interesting than any results to which their > experiments lead. -- Norbert Wiener https://www.cse.buffalo.edu/~knepley/ > <https://urldefense.com/v3/__http:/www.cse.buffalo.edu/*knepley/__;fg!!Kjv0uj3L4nM6H-I!1KBn92fUc-8pAvJy257WTFoHD80IUf6u5iIhyL_vrliEm3psAK4KAJFCdygnPA$> > Schlumberger-Private * >