Hi, Matthew Thanks for the reply. However, I don't really get why additional malloc would double the memory footprint. If I know there is only 1GB matrix being loaded, there shouldn't be 2GB memory occupied even if Petsc needs to allocate more space.
regards, Karl On Thu, Jul 2, 2020 at 8:10 PM Matthew Knepley <knep...@gmail.com> wrote: > On Thu, Jul 2, 2020 at 7:30 PM Karl Lin <karl.lin...@gmail.com> wrote: > >> Hi, Matt >> >> Thanks for the tip last time. We just encountered another issue with >> large data sets. This time the behavior is the opposite from last time. The >> data is 13.5TB, the total number of matrix columns is 2.4 billion. Our >> program crashed during matrix loading due to memory overflow in one node. >> As said before, we have a little memory check during loading the matrix to >> keep track of rss. The printout of rss in the log shows normal increase in >> many nodes, i.e., if we load in a portion of the matrix that is 1GB, after >> MatSetValues for that portion, rss will increase roughly about 1GB. On the >> node that has memory overflow, the rss increased by 2GB after only 1GB of >> matrix is loaded through MatSetValues. We are very puzzled by this. What >> could make the memory footprint twice as much as needed? Thanks in advance >> for any insight. >> > > The only way I can imagine this happening is that you have not > preallocated correctly, so that some values are causing additional mallocs. > > Thanks, > > Matt > > >> Regards, >> >> Karl >> >> On Thu, Jun 11, 2020 at 12:00 PM Matthew Knepley <knep...@gmail.com> >> wrote: >> >>> On Thu, Jun 11, 2020 at 12:52 PM Karl Lin <karl.lin...@gmail.com> wrote: >>> >>>> Hi, Matthew >>>> >>>> Thanks for the suggestion, just did another run and here are some >>>> detailed stack traces, maybe will provide some more insight: >>>> *** Process received signal *** >>>> Signal: Aborted (6) >>>> Signal code: (-6) >>>> /lib64/libpthread.so.0(+0xf5f0)[0x2b56c46dc5f0] >>>> [ 1] /lib64/libc.so.6(gsignal+0x37)[0x2b56c5486337] >>>> [ 2] /lib64/libc.so.6(abort+0x148)[0x2b56c5487a28] >>>> [ 3] /libpetsc.so.3.10(PetscTraceBackErrorHandler+0xc4)[0x2b56c1e6a2d4] >>>> [ 4] /libpetsc.so.3.10(PetscError+0x1b5)[0x2b56c1e69f65] >>>> [ 5] >>>> /libpetsc.so.3.10(PetscCommBuildTwoSidedFReq+0x19f0)[0x2b56c1e03cf0] >>>> [ 6] /libpetsc.so.3.10(+0x77db17)[0x2b56c2425b17] >>>> [ 7] /libpetsc.so.3.10(+0x77a164)[0x2b56c2422164] >>>> [ 8] /libpetsc.so.3.10(MatAssemblyBegin_MPIAIJ+0x36)[0x2b56c23912b6] >>>> [ 9] /libpetsc.so.3.10(MatAssemblyBegin+0xca)[0x2b56c1feccda] >>>> >>>> By reconfiguring, you mean recompiling petsc with that option, correct? >>>> >>> >>> Reconfiguring. >>> >>> Thanks, >>> >>> Matt >>> >>> >>>> Thank you. >>>> >>>> Karl >>>> >>>> On Thu, Jun 11, 2020 at 10:56 AM Matthew Knepley <knep...@gmail.com> >>>> wrote: >>>> >>>>> On Thu, Jun 11, 2020 at 11:51 AM Karl Lin <karl.lin...@gmail.com> >>>>> wrote: >>>>> >>>>>> Hi, there >>>>>> >>>>>> We have written a program using Petsc to solve large sparse matrix >>>>>> system. It has been working fine for a while. Recently we encountered a >>>>>> problem when the size of the sparse matrix is larger than 10TB. We used >>>>>> several hundred nodes and 2200 processes. The program always crashes >>>>>> during >>>>>> MatAssemblyBegin.Upon a closer look, there seems to be something unusual. >>>>>> We have a little memory check during loading the matrix to keep track of >>>>>> rss. The printout of rss in the log shows normal increase up to rank >>>>>> 2160, >>>>>> i.e., if we load in a portion of matrix that is 1GB, after MatSetValues >>>>>> for >>>>>> that portion, rss will increase roughly about that number. From rank 2161 >>>>>> onwards, the rss in every rank doesn't increase after matrix loaded. Then >>>>>> comes MatAssemblyBegin, the program crashed on rank 2160. >>>>>> >>>>>> Is there a upper limit on the number of processes Petsc can handle? >>>>>> or is there a upper limit in terms of the size of the matrix petsc can >>>>>> handle? Thank you very much for any info. >>>>>> >>>>> >>>>> It sounds like you overflowed int somewhere. We try and check for >>>>> this, but catching every place is hard. Try reconfiguring with >>>>> >>>>> --with-64-bit-indices >>>>> >>>>> Thanks, >>>>> >>>>> Matt >>>>> >>>>> >>>>>> Regards, >>>>>> >>>>>> Karl >>>>>> >>>>> >>>>> >>>>> -- >>>>> What most experimenters take for granted before they begin their >>>>> experiments is infinitely more interesting than any results to which their >>>>> experiments lead. >>>>> -- Norbert Wiener >>>>> >>>>> https://www.cse.buffalo.edu/~knepley/ >>>>> <http://www.cse.buffalo.edu/~knepley/> >>>>> >>>> >>> >>> -- >>> What most experimenters take for granted before they begin their >>> experiments is infinitely more interesting than any results to which their >>> experiments lead. >>> -- Norbert Wiener >>> >>> https://www.cse.buffalo.edu/~knepley/ >>> <http://www.cse.buffalo.edu/~knepley/> >>> >> > > -- > What most experimenters take for granted before they begin their > experiments is infinitely more interesting than any results to which their > experiments lead. > -- Norbert Wiener > > https://www.cse.buffalo.edu/~knepley/ > <http://www.cse.buffalo.edu/~knepley/> >