Is it because indices for the nonzeros also need memory? --Junchao Zhang
On Thu, Jul 2, 2020 at 10:04 PM Karl Lin <karl.lin...@gmail.com> wrote: > Hi, Matthew > > Thanks for the reply. However, I don't really get why additional malloc > would double the memory footprint. If I know there is only 1GB matrix being > loaded, there shouldn't be 2GB memory occupied even if Petsc needs to > allocate more space. > > regards, > > Karl > > On Thu, Jul 2, 2020 at 8:10 PM Matthew Knepley <knep...@gmail.com> wrote: > >> On Thu, Jul 2, 2020 at 7:30 PM Karl Lin <karl.lin...@gmail.com> wrote: >> >>> Hi, Matt >>> >>> Thanks for the tip last time. We just encountered another issue with >>> large data sets. This time the behavior is the opposite from last time. The >>> data is 13.5TB, the total number of matrix columns is 2.4 billion. Our >>> program crashed during matrix loading due to memory overflow in one node. >>> As said before, we have a little memory check during loading the matrix to >>> keep track of rss. The printout of rss in the log shows normal increase in >>> many nodes, i.e., if we load in a portion of the matrix that is 1GB, after >>> MatSetValues for that portion, rss will increase roughly about 1GB. On the >>> node that has memory overflow, the rss increased by 2GB after only 1GB of >>> matrix is loaded through MatSetValues. We are very puzzled by this. What >>> could make the memory footprint twice as much as needed? Thanks in advance >>> for any insight. >>> >> >> The only way I can imagine this happening is that you have not >> preallocated correctly, so that some values are causing additional mallocs. >> >> Thanks, >> >> Matt >> >> >>> Regards, >>> >>> Karl >>> >>> On Thu, Jun 11, 2020 at 12:00 PM Matthew Knepley <knep...@gmail.com> >>> wrote: >>> >>>> On Thu, Jun 11, 2020 at 12:52 PM Karl Lin <karl.lin...@gmail.com> >>>> wrote: >>>> >>>>> Hi, Matthew >>>>> >>>>> Thanks for the suggestion, just did another run and here are some >>>>> detailed stack traces, maybe will provide some more insight: >>>>> *** Process received signal *** >>>>> Signal: Aborted (6) >>>>> Signal code: (-6) >>>>> /lib64/libpthread.so.0(+0xf5f0)[0x2b56c46dc5f0] >>>>> [ 1] /lib64/libc.so.6(gsignal+0x37)[0x2b56c5486337] >>>>> [ 2] /lib64/libc.so.6(abort+0x148)[0x2b56c5487a28] >>>>> [ 3] >>>>> /libpetsc.so.3.10(PetscTraceBackErrorHandler+0xc4)[0x2b56c1e6a2d4] >>>>> [ 4] /libpetsc.so.3.10(PetscError+0x1b5)[0x2b56c1e69f65] >>>>> [ 5] >>>>> /libpetsc.so.3.10(PetscCommBuildTwoSidedFReq+0x19f0)[0x2b56c1e03cf0] >>>>> [ 6] /libpetsc.so.3.10(+0x77db17)[0x2b56c2425b17] >>>>> [ 7] /libpetsc.so.3.10(+0x77a164)[0x2b56c2422164] >>>>> [ 8] /libpetsc.so.3.10(MatAssemblyBegin_MPIAIJ+0x36)[0x2b56c23912b6] >>>>> [ 9] /libpetsc.so.3.10(MatAssemblyBegin+0xca)[0x2b56c1feccda] >>>>> >>>>> By reconfiguring, you mean recompiling petsc with that option, correct? >>>>> >>>> >>>> Reconfiguring. >>>> >>>> Thanks, >>>> >>>> Matt >>>> >>>> >>>>> Thank you. >>>>> >>>>> Karl >>>>> >>>>> On Thu, Jun 11, 2020 at 10:56 AM Matthew Knepley <knep...@gmail.com> >>>>> wrote: >>>>> >>>>>> On Thu, Jun 11, 2020 at 11:51 AM Karl Lin <karl.lin...@gmail.com> >>>>>> wrote: >>>>>> >>>>>>> Hi, there >>>>>>> >>>>>>> We have written a program using Petsc to solve large sparse matrix >>>>>>> system. It has been working fine for a while. Recently we encountered a >>>>>>> problem when the size of the sparse matrix is larger than 10TB. We used >>>>>>> several hundred nodes and 2200 processes. The program always crashes >>>>>>> during >>>>>>> MatAssemblyBegin.Upon a closer look, there seems to be something >>>>>>> unusual. >>>>>>> We have a little memory check during loading the matrix to keep track of >>>>>>> rss. The printout of rss in the log shows normal increase up to rank >>>>>>> 2160, >>>>>>> i.e., if we load in a portion of matrix that is 1GB, after MatSetValues >>>>>>> for >>>>>>> that portion, rss will increase roughly about that number. From rank >>>>>>> 2161 >>>>>>> onwards, the rss in every rank doesn't increase after matrix loaded. >>>>>>> Then >>>>>>> comes MatAssemblyBegin, the program crashed on rank 2160. >>>>>>> >>>>>>> Is there a upper limit on the number of processes Petsc can handle? >>>>>>> or is there a upper limit in terms of the size of the matrix petsc can >>>>>>> handle? Thank you very much for any info. >>>>>>> >>>>>> >>>>>> It sounds like you overflowed int somewhere. We try and check for >>>>>> this, but catching every place is hard. Try reconfiguring with >>>>>> >>>>>> --with-64-bit-indices >>>>>> >>>>>> Thanks, >>>>>> >>>>>> Matt >>>>>> >>>>>> >>>>>>> Regards, >>>>>>> >>>>>>> Karl >>>>>>> >>>>>> >>>>>> >>>>>> -- >>>>>> What most experimenters take for granted before they begin their >>>>>> experiments is infinitely more interesting than any results to which >>>>>> their >>>>>> experiments lead. >>>>>> -- Norbert Wiener >>>>>> >>>>>> https://www.cse.buffalo.edu/~knepley/ >>>>>> <http://www.cse.buffalo.edu/~knepley/> >>>>>> >>>>> >>>> >>>> -- >>>> What most experimenters take for granted before they begin their >>>> experiments is infinitely more interesting than any results to which their >>>> experiments lead. >>>> -- Norbert Wiener >>>> >>>> https://www.cse.buffalo.edu/~knepley/ >>>> <http://www.cse.buffalo.edu/~knepley/> >>>> >>> >> >> -- >> What most experimenters take for granted before they begin their >> experiments is infinitely more interesting than any results to which their >> experiments lead. >> -- Norbert Wiener >> >> https://www.cse.buffalo.edu/~knepley/ >> <http://www.cse.buffalo.edu/~knepley/> >> >