Some 4.4 project musings

Andrew MacLeod Fri, 01 Feb 2008 12:56:02 -0800

There are a few things that I'm considering working on for the 4.4release, and I figured I'd see what others thought. Is anyoneconsidering/doing similar or related work?

I'll summarize each, and then go into more details.

1 - Pass cleanup. There have been rumblings about this, but I haven'tseen a lot of actual progress. We currently run 100's of passes all thetime, and we don't really know how effective or necessary many of themare. We also don't know whether the cleanups they run are truly doingmuch. It would be nice to do some actual analysis and do something withthe results.

2 - Interface to the virtual operands. The virtual operand web provideswhat really amounts to low level detail of memory accesses. Every passthat cares about memory access ends up interpreting the data and dealingwith the VOP web modification itself. This is in some ways analogous toevery pass having to implement bsi_insert() itself. It would be nice ifthere was one standard central interpreter of the data that knows how toview and modify the web. Richi is working on an alias oracle of somesort, I'm not sure of its details so I don't know how far it delves intothis sort of thing, or what approach it takes.

3 - SRA. There appear to be some deficiencies in SRA, and also how itinteracts with the MEMSSA partitioner. I found when looking at some 4.3bugs that its never as simple as looking at the base and offset, atleast not consistently. Most of it appears to befirst-implementation-itis, the discovering of new issues as you go.Stepping back and reviewing the functional requirements and itsinteraction with the partitioner should be useful.

4 - SSA pressure reduction. I'm throwing this back on the table. I neverquite got around to it before, and nothing has changed to resolve theissues. We freely create as much register pressure in the SSA optimizersas we want (as we should be able to). The backend has doing nothing toaddress the issue and the RTL register allocator is simply swamped bythe sheer quantity of live ranges sometimes. Perhaps Vlad's RA will getin this release, and/or perhaps the need for this will be eliminated bysomething else, but it is something that may help code generation in theshort term at least.

Those are the primary subjects, in no particular order. Now I'll delve abit deeper into the details of whats involved for each one.If anyone is interested in any of these tasks, or has anyquestions/observations/criticisms, please let me know. (I know you will!:-)

I will also stick something up on the wiki in a bit with thisinformation, and whatever other details come out fromsuggestions/discussions or further investigation.




Pass Cleanup
-----------------

It would sure be nice to streamline our pass pipeline. One could spendthe rest of ones life doing this by hand. The 2 biggest problems wouldseem to be:

* we have no idea whether a pass actually does much
* and we have no idea whether it was actually useful

So I was thinking that maybe we could modify passes to report what theydid. When CHECKING_ENABLED is on (or something like that), every passreports what it did, possibly to the pass manager.


Initially I was thinking it might have something like:
   * number of statements changed
   * number of statements added
   * number of statements deleted
   * number of names added
   * number of names deleted

And then a report is issued for the compilation listing every pass thatran and a summary of this data for each occurrence. Each of the TODOcleanups run by each pass should also have its data listed.

Then we could run the compiler over a whack of testcases, accumulate allthese reports, and generate some data on which passes and cleanups arenot doing very much and are candidates for closer inspection.

One could then turn off one or more of these passes, run again, and seewhether there appeared to be much impact on the other passes.

A closer inspection may identify that perhaps the pass should besomewhere else in the pipeline, run only at -O3, completely eliminated,or modified in some way.

There are also categories of optimizations:
 * optimization performers - The workhorses that actually do useful things.

* optimization enablers - introduce situations which will enable alater optimizer.* cleanups - Those which remove crud or undo enabler work which wasnot profitable.

And these should probably be treated differently. Enablers tend to workin concert with optimizations and sometime also cleanups. You need tolook at the data for them together to see if useful work was done. Ifthe cleanup is usually undoing everything the enabler did, then theenabler isn't really enabling, its just chewing cycles :-) (or there isa flaw in the optimization, in any case, it all deserves a closer lookif the group isn't accomplishing much)

It also possible that a pass should only be run on a specificarchitecture or set of arches. I see no reason why we shouldn't allowthe pass pipeline to be tuned for specific architectures. Noteverything that is good for a 32 register machine is good for one with 8registers and vice versa.

This could then be further extended into the RTL passes, and there aresome other extensions that could be useful. It would be nice forinstance if we could statically guess at whether the runtime wasaffected or not, and by how much. Many modifications that optimizationsmake aren't really going to be measurable by simply testing execution speed.

* The scheduler perhaps could spit out a summary of what it thinks thenumber of cycles through the execution of the predicted path/key/allblocks would be.* The loop optimizers could submit summary info about loops, and tiedin with the scheduler info, we could guess at whether an optimizationaffected the cycles in a loop by generating reports with and without thepass and comparing the cycles estimate of the core loops and main path.* We can also compare code size based on the object code produced, andcould work on the -Os pass pipeline as well.* these reports would be useful for some of the automated passshuffling experiments as well I would think.

And so on. Once properly set up, you could actually automate quite a bitof this and maybe get some very interesting data.

I think it would also be a good idea to set up a generic loggingmechanism for this. There are other tasks within the compiler that ageneric logging mechanism would be useful for. I've seen requests foroptimizers to generate reports on what they did or didn't do and why,providing hints to the programmer about how to change their code to getbetter optimizations. I think we even had/have a branch for this sortof thing. It seems like a good opportunity to get something generic inplace for future use.





Interface to Virtual Operands
-------------------------------------

I'd like to do a survey of all the optimizations which use virtualoperands to see how they use it and how they manipulate it. From that,we can extrapolate the kinds of questions that are asked and the kindsof manipulations that are performed. In particular look for slightvariations in the interpretation or manipulation of the VOP data.

Until that survey is done, its hard to predict what the interfaceroutines would be. There are certainly a base set such as

* can a load/store to B be inserted before/after this stmt
* can a load/store from B be moved from stmt1 to stmt2 safely
* if not, which stmt(s) are the blockers

* insert a store to B.* etc, etc.

The survey would help fill out this list and identify commonly donetasks and needs. We do enough stuff now that it would make a pretty goodrepresentational data set I think.

I expect some passes hold a modified state within the pass which isn'treflected in the VOP web. The interpreter could maybe hold an internalmodified state as well where the interface allows a pass to say 'I'mpretending to insert this stmt , and that stmt is fake-deleted'. Itwould then allow the queries to reflect the situation based on thesemodifications. This state could be a state stack if need be forunwinding changes. This would prevent optimizations from having to takecare of all this crap themselves, and simplify maintenance and futurecoding.

That might turn out to be quite difficult, its hard to say without theinitial survey of what and how things are done. We certainly haveenough passes using the information now that we can get a feel for whatcommon ground there is for interpreting the VOP web.

In thinking about a pressure reduction pass for instance, it would bemuch simpler to write it if these facilities were in place as generalroutines, even if it were just the query routines. I'm assuming Richi'soracle would address some of this?

The first step would simply be to conduct the survey and see what has tobe dealt with.





SRA
------

This is a shorter subject since I know less about it at the moment, justobservations mostly.There ought to be a simple and consistent way to look at the data in twoSRA'd elements and tell whether there is a conflict or equivilency.Right now there appears to be issues with sub-structues such that theoffset and length are not directly comparable all the time because thesecond element might be off of a different base. Gross simplification,but it came up when delving into a PR for 4.3. It might be as simple asa common routine that can look at the different aspects and figure it out.

Since we have a working implementation, it should be possible identifythe deficiencies we do have and figure out a way to treat them consistently.

SRA elements don't interact properly with the MEMSSA partitioningeither. The partitioner only sees the first element of a structure andtherefore large structures end up not triggering the partitioner. Thisleaves large structures with, say 300 elements unpartitioned, andresults in exceptionally large numbers of virtual ops, which is exactlywhat MEMSSA is suppose to address in the first place. The operand codeand partitioning code currently interpret this differently, and thereare places in the operand code which don't even attempt to look atoffset, just assumes they always overlap.


I see three main focuses so far:
* Create a mechanism for consistency with offsets, lengths, and bases

* Find and change the places that currently just give up, either withouttrying or making half hearted efforts

* Make SRA'd items interact with the MEMSSA partitioning better.

Additional comments and data is welcome :-)



SSA_NAME pressure reduction
-----------------------------------------

The intent here is not to do the job of the register allocator, nor isit to spend a lot of time on a disposable optimization. I suggested thisa couple of years ago as a component of rewriting out-of-ssa andcombining it with a rewrite of expand. I suggest it here again simplyas a pass run in the middle of out-of-ssa. Once out-of-ssa hasdetermined what ssa_names are going to coalesce, the live range info issomewhat representative of what will be generated during expand.

We currently have situations where a lot of optimizations happen, and weend up providing a function to the backend which has 200 registers liveat once at some point. If RA is trying to allocate the function to anarchitecture with 8 registers, it has an awful lot of work to do.

The existing register allocator does a decent job if it doesn't have tospill. I think it probably does an acceptable job even if it has tospill a bit. It does appear to break down badly if it has to spill a lot.

The correct long term solution is to solve the problem in the registerallocator. As we all know, this is not a trivial task. Vlad has IRA onthe horizon and I believe it is targeted for 4.4. I don't believe itapplies to all architectures yet, and I'm not aware of anything else inthe 4.4 timeline that can help.

So as a temporary solution (until a proper one presents itself), Isuggest that a pre-spiller serves the purpose. Take a function which hasway too many live ranges, and pre-spill some of the values to make thefunction more amenable to our register allocator. If you are targeting 8registers, then reduce the pressure from 200 down to 11 or 12 peak,something more manageable. The exact number would be found during tuning.The ideal place to do this would be right before RA. Thats when all theRTL optimizations have run, and its close to what the register allocatoris going to see, so the data is the most accurate. If someone want totry that, bonus, that would be great. The new DF infrastructure mayhelp, but I think a lot of useful information is already gone by thatpoint.

It seems like a reasonably easy job to do on SSA form. The data isn't asaccurate as it would be at RA time, but you can get the general feel anddo some en-mass lifetime reductions fairly cheaply and quickly. If thereare 200 SSA_NAMES live on entry to BB12, it is likely to help if we canreduce that to a more reasonable number.

Generally speaking, when the register allocator spills, it will store avalue right after it is defined, and reload the value just before eachuse. If a value can recomputed, then you can avoid the store and simplyrecompute (aka rematerialize) the value just before it is used. Thistechnique has the property of reducing the live range pressure over thestatements between the def and each use.


Pressure reduction general process:
* calculate the live ranges of all non-virtual ssa_name.

* calculate a spill cost for each ssa_name. This is a factor of thingslike how many uses there are, whether the def is recomputable, whetheroccurences are inside loops, and over what distance they are live.* choose ssa_names for spilling which will help pressure in hot areas,for as long as the number of live ranges exceed a threshold.

* rewrite the code to 'spill' these ssa_names.

There are bazillions of refinements that can be performed, but itdoesn't serve a lot of purpose to discuss them right now.

I have a previous first-take at calculating ssa_name pressure, butnothing beyond that. (ssa-pressure-branch). The fastest approach wouldbe to first take that pressure code, and quickly add the remaining bitsto handle simple loads and see if there appears to be enough benefit tocontinue the work.

One of the primary reasons I think there may be some benefit to this isthat non-virtual SSA_NAMESs map to registers at out-of-ssa/expand time.Local variables are loaded into an ssa_name/register and are kept therefor their lifetime. Overlapping live ranges are given distinctvariables, so when it comes to spilling, we don't have to aliasanalysis, etc. The net effect of "spilling" here is to simply keep thelocal variable in memory instead of trying to keep it in a register whenwe go to RTL. for instance:


a_2 = a
...
a_6 = a_2 + 5
..
b_5 = a_2 * h_3

if a_2 is selected for spilling, the resulting code is simply:

...
a_88 = a
a_6 = a_88 + 5
...
a_89 = a
b_5 = a_89 * h_3

TER already helps with register pressure a bit. When an SSA_NAME issubstituted into an expression, the register pressure between theoriginal load location and its use is reduced. TER will currently *only*substitute SSA_NAMES into expressions when there is a single use of theSSA_NAME. There are many times when an SSA_NAME is used a couple oftimes, and if TER had substituted them the results would be better.

This pressure reduction mechanism can trigger TER automatically if thepressure is high enough. In this example, a_88 and a_89 would bothsubstituted by TER (since they are single def/use), and the desiredresult is achieved:

...
a = a + 5
...
a = a * h_3

This same mechanism will also result in TER picking up some things thatuse to cross block boundaries when the situation is appropriate.

The second stage might be to look at SSA_NAMES which are expressions.Any SSA_NAME which is calculated as an expression would simply bestored/loaded to a new temp:


a_2 = b_6 + c_2
...
g_9 = a_2 * 2

becomes

a_2 = b_6 + c_2
tmp.9 = a_2
...
a_88 = tmp.9
g_9 = a_88 * 2

If this also shows some promise of being useful, then we can startlooking at rematerializing expressions and other such enhancements whichappear to be worthwhile and easy to do.

Some 4.4 project musings

Reply via email to