Re: lto

Jan Hubicka Thu, 14 Feb 2008 16:43:14 -0800

> I have to admit that i had not considered the issue of having the later
> stages of compilation of one function feed into the early stages of
> another.   The problem with this is that it implies a serial ordering of
> the compilation and that is not going to yield a usable system for
> compiling very large programs.


Yes, this is why I wanted to bring this in.  I am not expert for whole
program register allocation, but my understanding is that it can really
help, especially on machine like x86-64, where saving
prologues/epilogues are relatively expensive.  It seems to me that it
can't scale to really big programs, but it don't have to if we subdivide
the programs sanely (ie do equivalent of region based compilation for
large function).   As long as the calls crossing boundaries are not very
frequent, we should be all happy.

Vladimir, Andrew, I am sure you know a lot more about global RA.
Is it resonable to expect that RA needs to propagate things just down in
topological order and can we split the program to resonably sized chunks
where the RA can be handled locally?

RA is IMO similar case as some other less standard optimizations that
hardly fit into the planned framework. Those don't need necesarily cope
with RTL level of info. Some of struct-reorg transforms affecting both
datastructures and algorithms, for instance, might be problem. In
analyze first, do all modification later framework their changes to
function bodies will tend to affect analysis of all other passes so we
would end up with need to teach every pass about other passes summaries
and wire in a lot of knowledge about pass ordering.  In future it might
be dificult to maintain in future 50 such passes doing complex things
affecting analysis of each other before modification is applied. It
seems clear that for programs in C++ and similar languages we will need
more advanced IPA tricks that what is commonly done and understood
today.

Except for the RA, it seem that one can always argue case by case how
individual transform can be implemented in the combination with other
passes, but overall having in our plans place for "BIG IPA" for the true
whole program things scalling to thousdands and millions of functions
and "small IPA" for hady things on resonably sized regions of program
(that is what fits in memory and we want to compile at one node) might
make our life easier. 

I can quite imagine "BIG IPA" doing things like profile read/estimation,
constant propagation, devirtualization, clonning, (partial) inlining,
alias analysis, identifying of identical function bodies, splicing
program to regions for local compilation.  That is, say, about 10-20
passes where most of them are analysis or based mostly on copying code
(clonning/inlining) that is easy to fit into the framework.

"small IPA" can do messy things like changing datastructures that does
not escape regions, perhaps more involved versions of the passes already
done: after doing all the inlining and using IPA data programs tends to
simplify a lot (several times for C++ as can be easilly seen with data
from early inlining and late inlining).  Re-doing more exact alias
analysis or inlining or whatewer might be interesting then.  During
final compilation it can do IPA RA and related propagation of low level
info we realize at RTL level.

The big IPA/small IPA is however still compatible with what you propose.
We just do same splicing at each node and handle the splices.
> 
> One of the the things to consider on this issue is how many of these
> feedbacks can actually be preanalyzed.  Before I wrote the pure-const
> ipa pass, we discovered if a function was pure or const in some late rtl
> pass and only later functions (in the compilation order) were able to
> take advantage of that fact.  By determining this as an ipa pass, we
> have completely removed that ordering dependency.   I admit that others
> dependencies will be harder, if not impossible, to accommodate, but we
> may find that many are.

There are many tings you can do in the early IPA as we do it now.
The late bits IMO reduce to very low level stuff.  I.e.
  1) for RA it is good to know what the called function really clobbers
  and save only registers that are needed (that is very simple global
  RA).  I would say it is quite clear we can't resonaly estimate this on
  the gimple level before inlining we see in early IPA.
  2) on i386 and friends we propagate down the amount of stack alignment
  needed.  On gimple level IPA we can probably look for leaf functions 
  and look for largest data it operates on, but it is at least quite
  inexact.
  See stack_alignment_needed, preferred_stack_boundary code.
  3) it was several times considered that on PIC i386 libraries (and all
  targets where loading of GOT pointer is expensive), we can make
  alternative entry points to functions that expect GOT pointer to be
  already initialized.  All calls within current unit can go to the
  local pointer if the caller already initialized GOT.
  If course one of the entry points can be optimized out if not used at
  all.

  I used to have patch for this, but except for the ellimination of
  entry points it don't need much IPA propagation, just downard
  propagation
  4) some of our stuff, like decision if function can throw are
  estimated at IPA level and later refined if optimization helps.
  Same would apply to pure functions too, but it is not done.
  5) I think more examples can be found

I am not say that all our code quality stands on the above tricks, but
we should understand what we are giving up in our design and see if
something is really imporant.
I think that RA is only that truly matters but I might've easilly missed
something.
> 
> I also have to admit that i did not consider the placement of functions
> when i came up with my sorting strategy.  I admit that my first simple
> trick will likely lead to a bad ordering of functions, but so will
> producing a separate .o file for every function (unless we later have a
> smart linker that can reorder the functions to get a good answer).

I think that we need independent way to order function in the resulting
binary anyway. Optimal code layout tends to be exact oposite of ordering
we benefit from most during compiling.  Gas support subsections and
fragments that helps here and some other systems has reordring built
into linker.

At the moment we do just very trivial function reordering: costructors,
cold and very hot functions each have separate subsection, so if the hot
region of program is small we are just happy with this.  Once whole
program is in and this is interesting, I think we should consider linker
support for this as done eslewhere unless we manage way to produce
single big .s file where we can use GAS's fragments.
> 
> However, my meta point still stands.  There is some global deterministic
> sorting that the cgraph analyzer can do that each processor, with no
> additional coordination can deterministicly pick which functions to
> compile in its own space.  For instance, we can use the existing order
> and just have each process divide that order in some standard way by N
> and take the nth slice for itself.
> 
> If you do this, you can then, at least locally within the cluster of
> functions, allocated on a single processor, do many of the tricks that
> you describe above.

Yep, we seem to be in sync here ;)
I am just not really big fan of idea that the optimization decisions
will depend on number of CPUs user has used to compile program.

I will need to think a bit on the distribution issues.  While you are
right that most of network filesystem will not bring the whole file when
you read the initial summary, it seems to be that as a result of
inlining decisions and program splitting we might end up reading
significant portion of the object files densely enough so we will
consume a lot of bandwidth.

Also I don't see you by doing the splitting completely in isolation
based only on knowledge on number of nodes and very inexact estimate of
size of functions that might relate to compilation time, you can
ballance the work so the nodes work for resonaly similar amount of time
when the time depends on many factor including nonuniformity of
hardware.

However I fully agree that we need experimenting here: I don't see the
decision on particular of those two models (ie fully centralised with
shipping objects to nodes or fully decentralized with each node making
all decision by its own) is significandly affecting any of very near
steps in the overall LTO/IPA plan outline at wiki: it would be good to
keep in mind in the serializing code that it might be used to serialize
just portion of the program in memory that is probably all needed to
implement the centralized scheme or something in between if we find it
benefical.

Honza

Re: lto

Reply via email to