diego and honza diego asked on irc were we planning to be able to serialize out all of the combined declarations. My response we could certainly use the machinery that we currently have to do this.
However, I believe that Deigo's motivation is somewhat flawed, and for that matter the plans in the whoper document. I think that it is clear that for lto to be successful, we are going to have to distribute the computation among either multiple processors in a single box or multiple machines scattered throughout a network. However there are a lot of was to skin this cat, and the whoper way appears quite complex. I know that google people are not scared of writing a lot of code, but in truth, we can get what is likely to be a good solution for less than 1000 lines of code, and I believe that that is what we should shoot for first. Honza and i are a few days away from being able to read in the cgraph from each of the .o files, analyze the cgraph and then load the functions from those same .o files, AS NECESSARY, for computation. It seems that there are two small pieces of code need to be written to make a distributed lto: 1) A driver that knows what computational resources are available. Say you have N processors. All that this driver has to do is spawn N copies of the command line to the N different processors. Each command line only differs in one option. The index of that processor in the group of resources: i.e. "-flto-index n" for n = 0 ... N-1. If you are using an mp, this is done with threads, if on a network it is done with ssh commands. It will also need to make sure the .o file produced has the n somehow in it so that things can be put back together later by the linker. 2) The cgraph analyzer, which is the part that figures out what to compile next needs to do two additional tasks: a) Sort each function by the estimated size (after inlining). This will be the order that functions will be compiled in. This sort needs to be stable and deterministic (both easy constraints). b) Each version of the spawned compilers actually only compiles function X if the index in the sort is x and x mod N == the value of -flto-index. The sort is important because it allows for load balancing with very little effort. What this scheme means is that all of the type merging and ipa pass analysis will be replicated on all of the processors in the group. However, this will not increase the wall time, only the cpu time (what's time to a pig?). It may and most likely will decrease the wall time which is the resource that we are actually most interested in. I really do believe that this can be made to work very quickly, and unless/until it is shown not to be good enough it is going to be the fastest way to get the system up and working on more than one processor. Writing out enough state to get things restarted on some other machine is going to be a lot of work. Of course all of this depends on our compiler being deterministic. The hashing police may have to be given special emergency powers to enforce this, but the truth is we are pretty good about this now. Comments? Kenny