lto

Kenneth Zadeck Thu, 14 Feb 2008 09:47:10 -0800

diego and honza

diego asked on irc were we planning to be able to serialize out all of
the combined declarations.
My response we could certainly use the machinery that we currently have
to do this.


However, I believe that Deigo's motivation is somewhat flawed, and for
that matter the plans in the whoper document.

I think that it is clear that for lto to be successful, we are going to
have to distribute the computation among either multiple processors in a
single box or multiple machines scattered throughout a network.  However
there are a lot of was to skin this cat, and the whoper way appears
quite complex.  I know that google people are not scared of writing a
lot of code, but in truth, we can get what is likely to be a good
solution for less than 1000 lines of code, and I believe that that is
what we should shoot for first.

Honza and i are a few days away from being able to read in the cgraph
from each of the .o files, analyze the cgraph and then load the
functions from those same .o files, AS NECESSARY, for computation.

It seems that there are two small pieces of code need to be written to
make a distributed lto:

1) A driver that knows what computational resources are available.

Say you have N processors.  All that this driver has to do is spawn N
copies of the command line to the N different processors.  Each command
line only differs in one option.  The index of that processor in the
group of resources: i.e. "-flto-index n" for n = 0 ... N-1.  If you are
using an mp, this is done with threads, if on a network it is done with
ssh commands.

It will also need to make sure the .o file produced has the n somehow in
it so that things can be put back together later by the linker.

2) The cgraph analyzer, which is the part that figures out what to
compile next needs to do two additional tasks:

a) Sort each function by the estimated size (after inlining).  This will
be the order that functions will be compiled in.  This sort needs to be
stable and deterministic (both easy constraints).
  
b) Each version of the spawned compilers actually only compiles function
X if the index in the sort is x and x mod N == the value of
-flto-index.  The sort is important because it allows for load balancing
with very little effort.

What this scheme means is that all of the type merging and ipa pass
analysis will be replicated on all of the processors in the group. 
However, this will not increase the wall time, only the cpu time (what's
time to a pig?).  It may and most likely will decrease the wall time
which is the resource that we are actually most interested in.

I really do believe that this can be made to work very quickly, and
unless/until it is shown not to be good enough it is going to be the
fastest way to get the system up and working on more than one processor.
Writing out enough state to get things restarted on some other machine
is going to be a lot of work. 

Of course all of this depends on our compiler being deterministic.  The
hashing police may have to be given special emergency powers to enforce
this, but the truth is we are pretty good about this now.


Comments?

Kenny

lto

Reply via email to