On Fri, Jul 3, 2020 at 12:52 PM Karl Lin <karl.lin...@gmail.com> wrote:
> Hi, Matthew > > Thanks for the reply. However, if the matrix is huge, like 13.5TB in our > case, it will take significant amount of time to loop over insertion twice. > Any other time and resource saving options? Thank you very much. > Do you think you could do it once and time it? I would be surprised if it takes even 1% of your total runtime, and I would also like to see the timing in that we might be able to optimize something for you. Thanks, Matt > Regards, > > Karl > > On Fri, Jul 3, 2020 at 10:57 AM Matthew Knepley <knep...@gmail.com> wrote: > >> On Fri, Jul 3, 2020 at 11:38 AM Karl Lin <karl.lin...@gmail.com> wrote: >> >>> Hi, Barry >>> >>> Thanks for the explanation. Following your tip, I have a guess. We use >>> MatCreateAIJ to create the matrix, I believe this call will preallocate as >>> well. Before this call we figure out the number of nonzeros per row for all >>> rows and put those number in an array, say numNonZero. We pass numNonZero >>> as d_nnz and o_nnz to MatCreateAIJ call, so essentially we preallocate >>> twice as much as needed. For the process that double the memory footprint >>> and crashed, there are a lot of values in both the diagonal and >>> off-diagonal part for the process, so the temporary space gets filled up >>> for both diagonal and off-diagonal parts of the matrix, also there are >>> unused temporary space until MatAssembly, so gradually fill up the >>> preallocated space which doubles the memory footprint. Once MatAssembly is >>> done, the unused temporary space gets squeezed out, we return the correct >>> memory footprint of the matrix. But before MatAssembly, large amount of >>> unused temporary space needs to be kept because of the diagonal and >>> off-diagonal pattern of the input. Would you say this is a plausible >>> explanation? thank you. >>> >> >> Yes. We find that it takes a very small amount of time to just loop over >> the insertion twice, the first time counting the nonzeros. We built >> something to do this for you: >> >> >> https://www.mcs.anl.gov/petsc/petsc-current/docs/manualpages/Mat/MatPreallocatorPreallocate.html >> >> Thanks, >> >> Matt >> >> >>> Regards, >>> >>> Karl >>> >>> On Fri, Jul 3, 2020 at 9:50 AM Barry Smith <bsm...@petsc.dev> wrote: >>> >>>> >>>> Karl, >>>> >>>> If a particular process is receiving values with MatSetValues() >>>> that belong to a different process it needs to allocate temporary space for >>>> those values. If there are many values destined for a different process >>>> this space can be arbitrarily large. The values are not pass to the final >>>> owning process until the MatAssemblyBegin/End calls. >>>> >>>> If you have not preallocated enough room the matrix actually makes >>>> a complete copy of itself with extra space for additional values, copies >>>> the values over and then deletes the old matrix this the memory use can >>>> double when the preallocation is not correct. >>>> >>>> >>>> Barry >>>> >>>> >>>> On Jul 3, 2020, at 9:44 AM, Karl Lin <karl.lin...@gmail.com> wrote: >>>> >>>> Yes, I did. The memory check for rss computes the memory footprint of >>>> column index using size of unsigned long long instead of int. >>>> >>>> For Junchao, I wonder if keeping track of which loaded columns are >>>> owned by the current process and which loaded columns are not owned also >>>> needs some memory storage. Just a wild thought. >>>> >>>> On Thu, Jul 2, 2020 at 11:40 PM Ernesto Prudencio <epruden...@slb.com> >>>> wrote: >>>> >>>>> Karl, >>>>> >>>>> >>>>> >>>>> >>>>> >>>>> >>>>> >>>>> >>>>> >>>>> >>>>> >>>>> >>>>> >>>>> >>>>> >>>>> >>>>> >>>>> >>>>> >>>>> >>>>> >>>>> >>>>> >>>>> * Are you taking into account that every “integer” index might be 64 >>>>> bits instead of 32 bits, depending on the PETSc configuration / >>>>> compilation >>>>> choices for PetscInt? Ernesto. From: petsc-users >>>>> [mailto:petsc-users-boun...@mcs.anl.gov <petsc-users-boun...@mcs.anl.gov>] >>>>> On Behalf Of Junchao Zhang Sent: Thursday, July 2, 2020 11:21 PM To: Karl >>>>> Lin <karl.lin...@gmail.com <karl.lin...@gmail.com>> Cc: PETSc users list >>>>> <petsc-users@mcs.anl.gov <petsc-users@mcs.anl.gov>> Subject: [Ext] Re: >>>>> [petsc-users] matcreate and assembly issue Is it because indices for the >>>>> nonzeros also need memory? --Junchao Zhang On Thu, Jul 2, 2020 at >>>>> 10:04 >>>>> PM Karl Lin <karl.lin...@gmail.com <karl.lin...@gmail.com>> wrote: Hi, >>>>> Matthew Thanks for the reply. However, I don't really get why additional >>>>> malloc would double the memory footprint. If I know there is only 1GB >>>>> matrix being loaded, there shouldn't be 2GB memory occupied even if Petsc >>>>> needs to allocate more space. regards, Karl On Thu, Jul 2, 2020 at >>>>> 8:10 PM Matthew Knepley <knep...@gmail.com <knep...@gmail.com>> wrote: On >>>>> Thu, Jul 2, 2020 at 7:30 PM Karl Lin <karl.lin...@gmail.com >>>>> <karl.lin...@gmail.com>> wrote: Hi, Matt Thanks for the tip last time. >>>>> We >>>>> just encountered another issue with large data sets. This time the >>>>> behavior >>>>> is the opposite from last time. The data is 13.5TB, the total number of >>>>> matrix columns is 2.4 billion. Our program crashed during matrix loading >>>>> due to memory overflow in one node. As said before, we have a little >>>>> memory >>>>> check during loading the matrix to keep track of rss. The printout of rss >>>>> in the log shows normal increase in many nodes, i.e., if we load in a >>>>> portion of the matrix that is 1GB, after MatSetValues for that portion, >>>>> rss >>>>> will increase roughly about 1GB. On the node that has memory overflow, the >>>>> rss increased by 2GB after only 1GB of matrix is loaded through >>>>> MatSetValues. We are very puzzled by this. What could make the memory >>>>> footprint twice as much as needed? Thanks in advance for any insight. >>>>> The >>>>> only way I can imagine this happening is that you have not preallocated >>>>> correctly, so that some values are causing additional mallocs. Thanks, >>>>> Matt Regards, Karl On Thu, Jun 11, 2020 at 12:00 PM Matthew >>>>> Knepley <knep...@gmail.com <knep...@gmail.com>> wrote: On Thu, Jun 11, >>>>> 2020 >>>>> at 12:52 PM Karl Lin <karl.lin...@gmail.com <karl.lin...@gmail.com>> >>>>> wrote: >>>>> Hi, Matthew Thanks for the suggestion, just did another run and here are >>>>> some detailed stack traces, maybe will provide some more insight: *** >>>>> Process received signal *** Signal: Aborted (6) Signal code: (-6) >>>>> /lib64/libpthread.so.0(+0xf5f0)[0x2b56c46dc5f0] [ 1] >>>>> /lib64/libc.so.6(gsignal+0x37)[0x2b56c5486337] [ 2] >>>>> /lib64/libc.so.6(abort+0x148)[0x2b56c5487a28] [ 3] >>>>> /libpetsc.so.3.10(PetscTraceBackErrorHandler+0xc4)[0x2b56c1e6a2d4] [ 4] >>>>> /libpetsc.so.3.10(PetscError+0x1b5)[0x2b56c1e69f65] [ 5] >>>>> /libpetsc.so.3.10(PetscCommBuildTwoSidedFReq+0x19f0)[0x2b56c1e03cf0] [ 6] >>>>> /libpetsc.so.3.10(+0x77db17)[0x2b56c2425b17] [ 7] >>>>> /libpetsc.so.3.10(+0x77a164)[0x2b56c2422164] [ 8] >>>>> /libpetsc.so.3.10(MatAssemblyBegin_MPIAIJ+0x36)[0x2b56c23912b6] [ 9] >>>>> /libpetsc.so.3.10(MatAssemblyBegin+0xca)[0x2b56c1feccda] By >>>>> reconfiguring, you mean recompiling petsc with that option, correct? >>>>> Reconfiguring. Thanks, Matt Thank you. Karl On Thu, Jun >>>>> 11, >>>>> 2020 at 10:56 AM Matthew Knepley <knep...@gmail.com <knep...@gmail.com>> >>>>> wrote: On Thu, Jun 11, 2020 at 11:51 AM Karl Lin <karl.lin...@gmail.com >>>>> <karl.lin...@gmail.com>> wrote: Hi, there We have written a program >>>>> using >>>>> Petsc to solve large sparse matrix system. It has been working fine for a >>>>> while. Recently we encountered a problem when the size of the sparse >>>>> matrix >>>>> is larger than 10TB. We used several hundred nodes and 2200 processes. The >>>>> program always crashes during MatAssemblyBegin.Upon a closer look, there >>>>> seems to be something unusual. We have a little memory check during >>>>> loading >>>>> the matrix to keep track of rss. The printout of rss in the log shows >>>>> normal increase up to rank 2160, i.e., if we load in a portion of matrix >>>>> that is 1GB, after MatSetValues for that portion, rss will increase >>>>> roughly >>>>> about that number. From rank 2161 onwards, the rss in every rank doesn't >>>>> increase after matrix loaded. Then comes MatAssemblyBegin, the program >>>>> crashed on rank 2160. Is there a upper limit on the number of processes >>>>> Petsc can handle? or is there a upper limit in terms of the size of the >>>>> matrix petsc can handle? Thank you very much for any info. It sounds >>>>> like >>>>> you overflowed int somewhere. We try and check for this, but catching >>>>> every >>>>> place is hard. Try reconfiguring with --with-64-bit-indices >>>>> Thanks, >>>>> Matt Regards, Karl -- What most experimenters take for >>>>> granted before they begin their experiments is infinitely more interesting >>>>> than any results to which their experiments lead. -- Norbert Wiener >>>>> https://www.cse.buffalo.edu/~knepley/ >>>>> <https://urldefense.com/v3/__http:/www.cse.buffalo.edu/*knepley/__;fg!!Kjv0uj3L4nM6H-I!1KBn92fUc-8pAvJy257WTFoHD80IUf6u5iIhyL_vrliEm3psAK4KAJFCdygnPA$> >>>>> -- What most experimenters take for granted before they begin their >>>>> experiments is infinitely more interesting than any results to which their >>>>> experiments lead. -- Norbert Wiener >>>>> https://www.cse.buffalo.edu/~knepley/ >>>>> <https://urldefense.com/v3/__http:/www.cse.buffalo.edu/*knepley/__;fg!!Kjv0uj3L4nM6H-I!1KBn92fUc-8pAvJy257WTFoHD80IUf6u5iIhyL_vrliEm3psAK4KAJFCdygnPA$> >>>>> -- What most experimenters take for granted before they begin their >>>>> experiments is infinitely more interesting than any results to which their >>>>> experiments lead. -- Norbert Wiener >>>>> https://www.cse.buffalo.edu/~knepley/ >>>>> <https://urldefense.com/v3/__http:/www.cse.buffalo.edu/*knepley/__;fg!!Kjv0uj3L4nM6H-I!1KBn92fUc-8pAvJy257WTFoHD80IUf6u5iIhyL_vrliEm3psAK4KAJFCdygnPA$> >>>>> Schlumberger-Private * >>>>> >>>> >>>> >> >> -- >> What most experimenters take for granted before they begin their >> experiments is infinitely more interesting than any results to which their >> experiments lead. >> -- Norbert Wiener >> >> https://www.cse.buffalo.edu/~knepley/ >> <http://www.cse.buffalo.edu/~knepley/> >> > -- What most experimenters take for granted before they begin their experiments is infinitely more interesting than any results to which their experiments lead. -- Norbert Wiener https://www.cse.buffalo.edu/~knepley/ <http://www.cse.buffalo.edu/~knepley/>