> Hi,
> 
> I have a backlog of random improvements to the WPA phase of LTO 
> compilations, all with the common goal of reducing peak memory usage.  I 
> was basically dumping all trees that the WPA phase read in, and then tried 
> to think about which trees can be merged with already existing ones very 
> early or alternatively which shouldn't have been in the global section 
> from the beginning.  I.e. whenever there were two trees left after reading 
> in all units, that IMO should have been the same, I either tried to merge 
> them, or find reasons why they in fact were not the same.

Thanks, this hard work was on my TODO list for a while.  I am very heappy you
beat me on this!
I will give a try your patch on Mozilla, I think the reuslts should be a lot
better than on GCC.

> There's much one can merge already during read-in, types are the easiest 
> thing, some obvious top-level trees are in the same league (integer and 
> string constants, tree_lists used to collect arguments, field_decls of the 
> just replaced types).  Later patches will extend this set.  This current 
> one will deal just with the mentioned entities.

For DECls, obviously the IPA passes will be in the way.
You mentioned the code that just reads things in and applies to same cgraph
nodes.  This will break in non-trivial situation.
I however think that all we need is a way to read in a DECL and or cgraph ID
from stream file and get the corresponding (merged) DECL or cgraph plus
information if the merged DECL is one the ID originally referred to. I guess
all the existing stream in code for cgraph & IPA passes can be quite easilly
restructured to ignore or merge the data on prevailed nodes during read.
In fact I planned this design originally. I also planned to organize the 
summaries
into per-function sections, instead of per-pass sections, so one can read in
only data of those that won in the merging battle.  Obviously the implementation
went other way early in LTO branch design.
> 
> One nice property of our reader is, that all trees are nicely collected in 
> a sequential array (the reader cache).  This implies that we don't have to 
> use walk_tree to get to all reachable trees, we merely have to iterate 
> through that array and are sure we'll catch everything (except some 
> builtin trees).  We also don't have to remember which trees we've already 
> visited, we'll naturally visit each one just once.
> 
> The current callbacks of walk_tree are used for two purposes: merging 
> types and setting prevailing decls.  The latter can only be done after the 
> prevailing decls are known, and that is only when we've seen all object 
> files (even in the case of the linker plugin).  Hence, while uniquifying 
> the read trees I'm also remembering those that (potentially) refer to 
> variables or functions in a hashtable, which conveniently has support in 
> the garbage collector to remove unreachable trees (so, we'll later only 
> iterate over those trees that are known to be reachable).  I'm using that 
> hash at the place where currently walk_tree is used to iterate over only 
> those trees that potentially refer to a var/function_decl.
> 
> The type merger is in essence a moved and slightly changed copy of the 
> original walk_tree callback routines.  Via some asserts I've verified that 
> I'm replacing all reachable vars or functions with their prevailing decl, 
> before removing the old callbacks (the same cannot be done with the merged 
> types, due to ordering issues type merging depends on other types already 
> being merged, or, worse, some decls being merged; that's already currently 
> the case, rerunning the type merger will produce more merged types).

Yeah, this also sounds all good. I wondered why the code is not organized this 
way
and instead keeps walking everything recursively.
> 
> Merging declarations
>  {GC 320745k -> 204020k}Reading summaries
>  ipa lto gimple out    :   6.08 ( 7%) usr   0.14 ( 9%) sys   6.23 ( 7%) wall  
>  0 kB ( 0%) ggc
>  ipa lto decl in       :  21.24 (23%) usr   0.38 (24%) sys  21.65 (23%) wall  
> 311556 kB (65%) ggc
>  ipa lto decl out      :  17.89 (19%) usr   0.01 ( 1%) sys  17.91 (19%) wall  
>  0 kB ( 0%) ggc
>  ipa lto decl merge    :  12.09 (13%) usr   0.13 ( 8%) sys  12.22 (13%) wall  
>  7704 kB ( 2%) ggc
>  inline heuristics     :  21.05 (23%) usr   0.00 ( 0%) sys  21.07 (22%) wall  
>  28972 kB ( 6%) ggc
>  TOTAL                 :  92.99             1.58            94.68             
> 479817 kB
> 
> I.e. right after reading in all units we have 320MB of ggc memory.
> 
> With the patch:
> 
> Merging declarations
>  {GC 193371k -> 176876k}Reading summaries
>  ipa lto gimple out    :   6.63 ( 8%) usr   0.19 ( 9%) sys  11.13 (12%) wall  
>      0 kB ( 0%) ggc
>  ipa lto decl in       :  27.09 (32%) usr   0.30 (14%) sys  27.79 (30%) wall  
> 318044 kB (66%) ggc
>  ipa lto decl out      :  13.96 (17%) usr   0.02 ( 1%) sys  13.99 (15%) wall  
>      0 kB ( 0%) ggc
>  ipa lto decl merge    :   0.32 ( 0%) usr   0.00 ( 0%) sys   0.33 ( 0%) wall  
>      5 kB ( 0%) ggc
>  inline heuristics     :  21.04 (25%) usr   0.03 ( 1%) sys  21.10 (23%) wall  
>  28972 kB ( 6%) ggc
>  TOTAL                 :  84.10             2.17            91.78             
> 478594 kB
> 
> So, about 193MB of ggc memory after reading in all units (and even after 
> merging using 28MB less).  As expected the decl-in phase is slower (27s vs 
> 21s), but that is more than offsetted by the mostly trivial decl-merge 
> phase (0s vs 12s).  Faster and better, good :)

Nice, the peak memory usage reduction is particularly important. 
It would be cool if we could just avoid building those redundant tree type 
nodes, or just
ggc_free them, to make gabrage collector guys happy ;)

Honza

Reply via email to