https://www.mcs.anl.gov/petsc/documentation/faq.html#efficient-assembly 
<https://www.mcs.anl.gov/petsc/documentation/faq.html#efficient-assembly>  
Perhaps we should provide more information at this FAQ to help track down such 
issues.


> On Jul 2, 2020, at 8:10 PM, Matthew Knepley <knep...@gmail.com> wrote:
> 
> On Thu, Jul 2, 2020 at 7:30 PM Karl Lin <karl.lin...@gmail.com 
> <mailto:karl.lin...@gmail.com>> wrote:
> Hi, Matt
> 
> Thanks for the tip last time. We just encountered another issue with large 
> data sets. This time the behavior is the opposite from last time. The data is 
> 13.5TB, the total number of matrix columns is 2.4 billion. Our program 
> crashed during matrix loading due to memory overflow in one node. As said 
> before, we have a little memory check during loading the matrix to keep track 
> of rss. The printout of rss in the log shows normal increase in many nodes, 
> i.e., if we load in a portion of the matrix that is 1GB, after MatSetValues 
> for that portion, rss will increase roughly about 1GB. On the node that has 
> memory overflow, the rss increased by 2GB after only 1GB of matrix is loaded 
> through MatSetValues. We are very puzzled by this. What could make the memory 
> footprint twice as much as needed? Thanks in advance for any insight.
> 
> The only way I can imagine this happening is that you have not preallocated 
> correctly, so that some values are causing additional mallocs.
> 
>   Thanks,
> 
>      Matt
>  
> Regards,
> 
> Karl 
> 
> On Thu, Jun 11, 2020 at 12:00 PM Matthew Knepley <knep...@gmail.com 
> <mailto:knep...@gmail.com>> wrote:
> On Thu, Jun 11, 2020 at 12:52 PM Karl Lin <karl.lin...@gmail.com 
> <mailto:karl.lin...@gmail.com>> wrote:
> Hi, Matthew
> 
> Thanks for the suggestion, just did another run and here are some detailed 
> stack traces, maybe will provide some more insight:
>  *** Process received signal ***
> Signal: Aborted (6)
> Signal code:  (-6)
> /lib64/libpthread.so.0(+0xf5f0)[0x2b56c46dc5f0]
>  [ 1] /lib64/libc.so.6(gsignal+0x37)[0x2b56c5486337]
>  [ 2] /lib64/libc.so.6(abort+0x148)[0x2b56c5487a28]
>  [ 3] /libpetsc.so.3.10(PetscTraceBackErrorHandler+0xc4)[0x2b56c1e6a2d4]
>  [ 4] /libpetsc.so.3.10(PetscError+0x1b5)[0x2b56c1e69f65]
>  [ 5] /libpetsc.so.3.10(PetscCommBuildTwoSidedFReq+0x19f0)[0x2b56c1e03cf0]
>  [ 6] /libpetsc.so.3.10(+0x77db17)[0x2b56c2425b17]
>  [ 7] /libpetsc.so.3.10(+0x77a164)[0x2b56c2422164]
>  [ 8] /libpetsc.so.3.10(MatAssemblyBegin_MPIAIJ+0x36)[0x2b56c23912b6]
>  [ 9] /libpetsc.so.3.10(MatAssemblyBegin+0xca)[0x2b56c1feccda]
> 
> By reconfiguring, you mean recompiling petsc with that option, correct?
> 
> Reconfiguring.
> 
>   Thanks,
> 
>     Matt
>  
> Thank you.
> 
> Karl
> 
> On Thu, Jun 11, 2020 at 10:56 AM Matthew Knepley <knep...@gmail.com 
> <mailto:knep...@gmail.com>> wrote:
> On Thu, Jun 11, 2020 at 11:51 AM Karl Lin <karl.lin...@gmail.com 
> <mailto:karl.lin...@gmail.com>> wrote:
> Hi, there
> 
> We have written a program using Petsc to solve large sparse matrix system. It 
> has been working fine for a while. Recently we encountered a problem when the 
> size of the sparse matrix is larger than 10TB. We used several hundred nodes 
> and 2200 processes. The program always crashes during MatAssemblyBegin.Upon a 
> closer look, there seems to be something unusual. We have a little memory 
> check during loading the matrix to keep track of rss. The printout of rss in 
> the log shows normal increase up to rank 2160, i.e., if we load in a portion 
> of matrix that is 1GB, after MatSetValues for that portion, rss will increase 
> roughly about that number. From rank 2161 onwards, the rss in every rank 
> doesn't increase after matrix loaded. Then comes MatAssemblyBegin, the 
> program crashed on rank 2160.
> 
> Is there a upper limit on the number of processes Petsc can handle? or is 
> there a upper limit in terms of the size of the matrix petsc can handle? 
> Thank you very much for any info.
> 
> It sounds like you overflowed int somewhere. We try and check for this, but 
> catching every place is hard. Try reconfiguring with
> 
>   --with-64-bit-indices
> 
>   Thanks,
> 
>      Matt
>  
> Regards,
> 
> Karl   
> 
> 
> -- 
> What most experimenters take for granted before they begin their experiments 
> is infinitely more interesting than any results to which their experiments 
> lead.
> -- Norbert Wiener
> 
> https://www.cse.buffalo.edu/~knepley/ <http://www.cse.buffalo.edu/~knepley/>
> 
> 
> -- 
> What most experimenters take for granted before they begin their experiments 
> is infinitely more interesting than any results to which their experiments 
> lead.
> -- Norbert Wiener
> 
> https://www.cse.buffalo.edu/~knepley/ <http://www.cse.buffalo.edu/~knepley/>
> 
> 
> -- 
> What most experimenters take for granted before they begin their experiments 
> is infinitely more interesting than any results to which their experiments 
> lead.
> -- Norbert Wiener
> 
> https://www.cse.buffalo.edu/~knepley/ <http://www.cse.buffalo.edu/~knepley/>

Reply via email to