> I have to admit that i had not considered the issue of having the later > stages of compilation of one function feed into the early stages of > another. The problem with this is that it implies a serial ordering of > the compilation and that is not going to yield a usable system for > compiling very large programs.
Yes, this is why I wanted to bring this in. I am not expert for whole program register allocation, but my understanding is that it can really help, especially on machine like x86-64, where saving prologues/epilogues are relatively expensive. It seems to me that it can't scale to really big programs, but it don't have to if we subdivide the programs sanely (ie do equivalent of region based compilation for large function). As long as the calls crossing boundaries are not very frequent, we should be all happy. Vladimir, Andrew, I am sure you know a lot more about global RA. Is it resonable to expect that RA needs to propagate things just down in topological order and can we split the program to resonably sized chunks where the RA can be handled locally? RA is IMO similar case as some other less standard optimizations that hardly fit into the planned framework. Those don't need necesarily cope with RTL level of info. Some of struct-reorg transforms affecting both datastructures and algorithms, for instance, might be problem. In analyze first, do all modification later framework their changes to function bodies will tend to affect analysis of all other passes so we would end up with need to teach every pass about other passes summaries and wire in a lot of knowledge about pass ordering. In future it might be dificult to maintain in future 50 such passes doing complex things affecting analysis of each other before modification is applied. It seems clear that for programs in C++ and similar languages we will need more advanced IPA tricks that what is commonly done and understood today. Except for the RA, it seem that one can always argue case by case how individual transform can be implemented in the combination with other passes, but overall having in our plans place for "BIG IPA" for the true whole program things scalling to thousdands and millions of functions and "small IPA" for hady things on resonably sized regions of program (that is what fits in memory and we want to compile at one node) might make our life easier. I can quite imagine "BIG IPA" doing things like profile read/estimation, constant propagation, devirtualization, clonning, (partial) inlining, alias analysis, identifying of identical function bodies, splicing program to regions for local compilation. That is, say, about 10-20 passes where most of them are analysis or based mostly on copying code (clonning/inlining) that is easy to fit into the framework. "small IPA" can do messy things like changing datastructures that does not escape regions, perhaps more involved versions of the passes already done: after doing all the inlining and using IPA data programs tends to simplify a lot (several times for C++ as can be easilly seen with data from early inlining and late inlining). Re-doing more exact alias analysis or inlining or whatewer might be interesting then. During final compilation it can do IPA RA and related propagation of low level info we realize at RTL level. The big IPA/small IPA is however still compatible with what you propose. We just do same splicing at each node and handle the splices. > > One of the the things to consider on this issue is how many of these > feedbacks can actually be preanalyzed. Before I wrote the pure-const > ipa pass, we discovered if a function was pure or const in some late rtl > pass and only later functions (in the compilation order) were able to > take advantage of that fact. By determining this as an ipa pass, we > have completely removed that ordering dependency. I admit that others > dependencies will be harder, if not impossible, to accommodate, but we > may find that many are. There are many tings you can do in the early IPA as we do it now. The late bits IMO reduce to very low level stuff. I.e. 1) for RA it is good to know what the called function really clobbers and save only registers that are needed (that is very simple global RA). I would say it is quite clear we can't resonaly estimate this on the gimple level before inlining we see in early IPA. 2) on i386 and friends we propagate down the amount of stack alignment needed. On gimple level IPA we can probably look for leaf functions and look for largest data it operates on, but it is at least quite inexact. See stack_alignment_needed, preferred_stack_boundary code. 3) it was several times considered that on PIC i386 libraries (and all targets where loading of GOT pointer is expensive), we can make alternative entry points to functions that expect GOT pointer to be already initialized. All calls within current unit can go to the local pointer if the caller already initialized GOT. If course one of the entry points can be optimized out if not used at all. I used to have patch for this, but except for the ellimination of entry points it don't need much IPA propagation, just downard propagation 4) some of our stuff, like decision if function can throw are estimated at IPA level and later refined if optimization helps. Same would apply to pure functions too, but it is not done. 5) I think more examples can be found I am not say that all our code quality stands on the above tricks, but we should understand what we are giving up in our design and see if something is really imporant. I think that RA is only that truly matters but I might've easilly missed something. > > I also have to admit that i did not consider the placement of functions > when i came up with my sorting strategy. I admit that my first simple > trick will likely lead to a bad ordering of functions, but so will > producing a separate .o file for every function (unless we later have a > smart linker that can reorder the functions to get a good answer). I think that we need independent way to order function in the resulting binary anyway. Optimal code layout tends to be exact oposite of ordering we benefit from most during compiling. Gas support subsections and fragments that helps here and some other systems has reordring built into linker. At the moment we do just very trivial function reordering: costructors, cold and very hot functions each have separate subsection, so if the hot region of program is small we are just happy with this. Once whole program is in and this is interesting, I think we should consider linker support for this as done eslewhere unless we manage way to produce single big .s file where we can use GAS's fragments. > > However, my meta point still stands. There is some global deterministic > sorting that the cgraph analyzer can do that each processor, with no > additional coordination can deterministicly pick which functions to > compile in its own space. For instance, we can use the existing order > and just have each process divide that order in some standard way by N > and take the nth slice for itself. > > If you do this, you can then, at least locally within the cluster of > functions, allocated on a single processor, do many of the tricks that > you describe above. Yep, we seem to be in sync here ;) I am just not really big fan of idea that the optimization decisions will depend on number of CPUs user has used to compile program. I will need to think a bit on the distribution issues. While you are right that most of network filesystem will not bring the whole file when you read the initial summary, it seems to be that as a result of inlining decisions and program splitting we might end up reading significant portion of the object files densely enough so we will consume a lot of bandwidth. Also I don't see you by doing the splitting completely in isolation based only on knowledge on number of nodes and very inexact estimate of size of functions that might relate to compilation time, you can ballance the work so the nodes work for resonaly similar amount of time when the time depends on many factor including nonuniformity of hardware. However I fully agree that we need experimenting here: I don't see the decision on particular of those two models (ie fully centralised with shipping objects to nodes or fully decentralized with each node making all decision by its own) is significandly affecting any of very near steps in the overall LTO/IPA plan outline at wiki: it would be good to keep in mind in the serializing code that it might be used to serialize just portion of the program in memory that is probably all needed to implement the centralized scheme or something in between if we find it benefical. Honza