Re: [RFC] Old school parallelization of WPA streaming

Jan Hubicka Thu, 29 Aug 2013 05:51:25 -0700

Jakub,
I am adding you to CC since I put my current toughts on LTO and debug info
in here.
> > Fork-fire-forget is really a much simpler choice here IMO; no worries 
> > about shared resources, less debug hassle.
> 
> It might be not as cheap as it is on Linux hosts on other hosts of
> course.  Also I'd rather try to avoid I/O than solving the issue


I still have some items on list here
 1) avoid function sections to be decompressed by WPA
    (this won't cause much compile time improvements as decompression is
     well bellow 10% of runtime)
 2) put variable initializers into named sections just as function bodies
    are.
    Seeing Martin's systemtaps of firefox/gimp/inkscape, to my surprise the
    initializers are actually about as big as the text segment.  While
    it seems bit wasteful to pust single integer_cst there (and we can
    special case this), it seems that there is a promise for vtables
    and other stuff.

    To make devirt work, we will need to load vtables into memory (or
    invent representation to stream them other way that would be similarly
    big). Still we will avoid need to load them in 5000 copies and merge
    them.
 3) I think good part of function/partitioning overhead is because abstract
    origin streaming is utterly broken.

    Currently we can have DECL_ABSTRACT_ORIGIN on a function.  This I can now
    track by used_as_abstract_origin flag and I can stream those functions
    into partitins using them.

    This is still wrong for multitude of reasons

    1) we really want DECL_INITIAL tree of the functions used as abstract
       origins in the form before any gimple optimizations happened on them.
       (that is when debug hook is called)
       This is not what happens - we stream the tree as it looks during
       TLO streaming time - i.e. after early optimizations.

       I think we may just (at a time calling the debug hook) duplicate 
DECL_INITIAL
       same way we duplicate decls for save_function_body and saving it 
elsewhere.
       Making this tree to be abstract origin of the offline copy of the 
function itself.

    2) dwarf2out doesn't really the DECL_INITIAL tree so it does something 
useful
       only when it is already there. 
       It can simply call cgraph_get_body when it needs the DECL_INITIAL, but it
       doesn't becuase push_cfun causes ICE.
       If we really can't push_cfun from middle of RTL queueu, I suppose I can
       just save it elsewhere

    3) It is not only toplevel decl that has origin, but all local vars in the
       function.

       I think this goes terribly wrong - these decls are not indexable so they
       are stored into function section of every function referring to them.
       They are then read in many duplicates and never merged with the 
DECL_INITIAL
       tree of the actual abstract origin. For some reason dwarf2out doesn't
       seem to ICE, but I also do not see how this can produce working debug.
       Moreover I think the duplicates contribute to our current debug info
       size problems with LTO.

       If we solve 1) as discussed by above (i.e. by having separate
       block trees for functions that are abstract origins), we can then solve 
3)
       by streaming those into global decl stream and make 
cross-function_context
       tree references to become global.

    4) Of course after early inlining function may need abstract origins from
       multiple other functions.  I do not track this at all.
       May be easy to just collect a vector of functions that are needed into
       cgraph_node.

    Of course solving 1)-4) is bit of early debug info without actually going to
    stream the dwarf dies, but by using the BLOCK trees as a temporary 
representation.
    Incrementally we can have this saved BLOCK tree to be a dwarf DIE and have
    origins to point to them instead of decls.

    To get resonable streaming performance it would be nice to have way to get
    abstract origin references cross-partition that debug info can accomplish.

Said that, I now have the fork() patch in all my trees and enjoy 50% faster
WPA times.  I changed my mind about claim that stremaing should be disk bound -
it is hard to hope for disk boundness for something that should fit in cache.

We went down from 5GB to 2GB of streaming for Firefox that is good.  But we will
see again 4GB once Martin's code layout work will land.  I think it is from good
part because of the origin fun above.

Honza

> by parallelizing it.  Of course we can always come back to this
> kind of hack later.
> 
> Richard.

Re: [RFC] Old school parallelization of WPA streaming

Reply via email to