There are a few things that I'm considering working on for the 4.4 release, and I figured I'd see what others thought. Is anyone considering/doing similar or related work?
I'll summarize each, and then go into more details.

1 - Pass cleanup. There have been rumblings about this, but I haven't seen a lot of actual progress. We currently run 100's of passes all the time, and we don't really know how effective or necessary many of them are. We also don't know whether the cleanups they run are truly doing much. It would be nice to do some actual analysis and do something with the results.

2 - Interface to the virtual operands. The virtual operand web provides what really amounts to low level detail of memory accesses. Every pass that cares about memory access ends up interpreting the data and dealing with the VOP web modification itself. This is in some ways analogous to every pass having to implement bsi_insert() itself. It would be nice if there was one standard central interpreter of the data that knows how to view and modify the web. Richi is working on an alias oracle of some sort, I'm not sure of its details so I don't know how far it delves into this sort of thing, or what approach it takes.

3 - SRA. There appear to be some deficiencies in SRA, and also how it interacts with the MEMSSA partitioner. I found when looking at some 4.3 bugs that its never as simple as looking at the base and offset, at least not consistently. Most of it appears to be first-implementation-itis, the discovering of new issues as you go. Stepping back and reviewing the functional requirements and its interaction with the partitioner should be useful.

4 - SSA pressure reduction. I'm throwing this back on the table. I never quite got around to it before, and nothing has changed to resolve the issues. We freely create as much register pressure in the SSA optimizers as we want (as we should be able to). The backend has doing nothing to address the issue and the RTL register allocator is simply swamped by the sheer quantity of live ranges sometimes. Perhaps Vlad's RA will get in this release, and/or perhaps the need for this will be eliminated by something else, but it is something that may help code generation in the short term at least.

Those are the primary subjects, in no particular order. Now I'll delve a bit deeper into the details of whats involved for each one. If anyone is interested in any of these tasks, or has any questions/observations/criticisms, please let me know. (I know you will! :-)

I will also stick something up on the wiki in a bit with this information, and whatever other details come out from suggestions/discussions or further investigation.



Pass Cleanup
-----------------

It would sure be nice to streamline our pass pipeline. One could spend the rest of ones life doing this by hand. The 2 biggest problems would seem to be:
* we have no idea whether a pass actually does much
* and we have no idea whether it was actually useful

So I was thinking that maybe we could modify passes to report what they did. When CHECKING_ENABLED is on (or something like that), every pass reports what it did, possibly to the pass manager.

Initially I was thinking it might have something like:
   * number of statements changed
   * number of statements added
   * number of statements deleted
   * number of names added
   * number of names deleted

And then a report is issued for the compilation listing every pass that ran and a summary of this data for each occurrence. Each of the TODO cleanups run by each pass should also have its data listed.

Then we could run the compiler over a whack of testcases, accumulate all these reports, and generate some data on which passes and cleanups are not doing very much and are candidates for closer inspection.

One could then turn off one or more of these passes, run again, and see whether there appeared to be much impact on the other passes.

A closer inspection may identify that perhaps the pass should be somewhere else in the pipeline, run only at -O3, completely eliminated, or modified in some way.
There are also categories of optimizations:
 * optimization performers - The workhorses that actually do useful things.
* optimization enablers - introduce situations which will enable a later optimizer. * cleanups - Those which remove crud or undo enabler work which was not profitable.

And these should probably be treated differently. Enablers tend to work in concert with optimizations and sometime also cleanups. You need to look at the data for them together to see if useful work was done. If the cleanup is usually undoing everything the enabler did, then the enabler isn't really enabling, its just chewing cycles :-) (or there is a flaw in the optimization, in any case, it all deserves a closer look if the group isn't accomplishing much)

It also possible that a pass should only be run on a specific architecture or set of arches. I see no reason why we shouldn't allow the pass pipeline to be tuned for specific architectures. Not everything that is good for a 32 register machine is good for one with 8 registers and vice versa.

This could then be further extended into the RTL passes, and there are some other extensions that could be useful. It would be nice for instance if we could statically guess at whether the runtime was affected or not, and by how much. Many modifications that optimizations make aren't really going to be measurable by simply testing execution speed.

* The scheduler perhaps could spit out a summary of what it thinks the number of cycles through the execution of the predicted path/key/all blocks would be. * The loop optimizers could submit summary info about loops, and tied in with the scheduler info, we could guess at whether an optimization affected the cycles in a loop by generating reports with and without the pass and comparing the cycles estimate of the core loops and main path. * We can also compare code size based on the object code produced, and could work on the -Os pass pipeline as well. * these reports would be useful for some of the automated pass shuffling experiments as well I would think.

And so on. Once properly set up, you could actually automate quite a bit of this and maybe get some very interesting data.

I think it would also be a good idea to set up a generic logging mechanism for this. There are other tasks within the compiler that a generic logging mechanism would be useful for. I've seen requests for optimizers to generate reports on what they did or didn't do and why, providing hints to the programmer about how to change their code to get better optimizations. I think we even had/have a branch for this sort of thing. It seems like a good opportunity to get something generic in place for future use.




Interface to Virtual Operands
-------------------------------------

I'd like to do a survey of all the optimizations which use virtual operands to see how they use it and how they manipulate it. From that, we can extrapolate the kinds of questions that are asked and the kinds of manipulations that are performed. In particular look for slight variations in the interpretation or manipulation of the VOP data.

Until that survey is done, its hard to predict what the interface routines would be. There are certainly a base set such as
* can a load/store to B be inserted before/after this stmt
* can a load/store from B be moved from stmt1 to stmt2 safely
* if not, which stmt(s) are the blockers
* insert a store to B. * etc, etc.

The survey would help fill out this list and identify commonly done tasks and needs. We do enough stuff now that it would make a pretty good representational data set I think.

I expect some passes hold a modified state within the pass which isn't reflected in the VOP web. The interpreter could maybe hold an internal modified state as well where the interface allows a pass to say 'I'm pretending to insert this stmt , and that stmt is fake-deleted'. It would then allow the queries to reflect the situation based on these modifications. This state could be a state stack if need be for unwinding changes. This would prevent optimizations from having to take care of all this crap themselves, and simplify maintenance and future coding.

That might turn out to be quite difficult, its hard to say without the initial survey of what and how things are done. We certainly have enough passes using the information now that we can get a feel for what common ground there is for interpreting the VOP web.

In thinking about a pressure reduction pass for instance, it would be much simpler to write it if these facilities were in place as general routines, even if it were just the query routines. I'm assuming Richi's oracle would address some of this?

The first step would simply be to conduct the survey and see what has to be dealt with.




SRA
------

This is a shorter subject since I know less about it at the moment, just observations mostly. There ought to be a simple and consistent way to look at the data in two SRA'd elements and tell whether there is a conflict or equivilency. Right now there appears to be issues with sub-structues such that the offset and length are not directly comparable all the time because the second element might be off of a different base. Gross simplification, but it came up when delving into a PR for 4.3. It might be as simple as a common routine that can look at the different aspects and figure it out.

Since we have a working implementation, it should be possible identify the deficiencies we do have and figure out a way to treat them consistently.

SRA elements don't interact properly with the MEMSSA partitioning either. The partitioner only sees the first element of a structure and therefore large structures end up not triggering the partitioner. This leaves large structures with, say 300 elements unpartitioned, and results in exceptionally large numbers of virtual ops, which is exactly what MEMSSA is suppose to address in the first place. The operand code and partitioning code currently interpret this differently, and there are places in the operand code which don't even attempt to look at offset, just assumes they always overlap.

I see three main focuses so far:
* Create a mechanism for consistency with offsets, lengths, and bases
* Find and change the places that currently just give up, either without trying or making half hearted efforts
* Make SRA'd items interact with the MEMSSA partitioning better.

Additional comments and data is welcome :-)



SSA_NAME pressure reduction
-----------------------------------------

The intent here is not to do the job of the register allocator, nor is it to spend a lot of time on a disposable optimization. I suggested this a couple of years ago as a component of rewriting out-of-ssa and combining it with a rewrite of expand. I suggest it here again simply as a pass run in the middle of out-of-ssa. Once out-of-ssa has determined what ssa_names are going to coalesce, the live range info is somewhat representative of what will be generated during expand.

We currently have situations where a lot of optimizations happen, and we end up providing a function to the backend which has 200 registers live at once at some point. If RA is trying to allocate the function to an architecture with 8 registers, it has an awful lot of work to do.

The existing register allocator does a decent job if it doesn't have to spill. I think it probably does an acceptable job even if it has to spill a bit. It does appear to break down badly if it has to spill a lot.

The correct long term solution is to solve the problem in the register allocator. As we all know, this is not a trivial task. Vlad has IRA on the horizon and I believe it is targeted for 4.4. I don't believe it applies to all architectures yet, and I'm not aware of anything else in the 4.4 timeline that can help.

So as a temporary solution (until a proper one presents itself), I suggest that a pre-spiller serves the purpose. Take a function which has way too many live ranges, and pre-spill some of the values to make the function more amenable to our register allocator. If you are targeting 8 registers, then reduce the pressure from 200 down to 11 or 12 peak, something more manageable. The exact number would be found during tuning. The ideal place to do this would be right before RA. Thats when all the RTL optimizations have run, and its close to what the register allocator is going to see, so the data is the most accurate. If someone want to try that, bonus, that would be great. The new DF infrastructure may help, but I think a lot of useful information is already gone by that point.

It seems like a reasonably easy job to do on SSA form. The data isn't as accurate as it would be at RA time, but you can get the general feel and do some en-mass lifetime reductions fairly cheaply and quickly. If there are 200 SSA_NAMES live on entry to BB12, it is likely to help if we can reduce that to a more reasonable number.

Generally speaking, when the register allocator spills, it will store a value right after it is defined, and reload the value just before each use. If a value can recomputed, then you can avoid the store and simply recompute (aka rematerialize) the value just before it is used. This technique has the property of reducing the live range pressure over the statements between the def and each use.

Pressure reduction general process:
* calculate the live ranges of all non-virtual ssa_name.
* calculate a spill cost for each ssa_name. This is a factor of things like how many uses there are, whether the def is recomputable, whether occurences are inside loops, and over what distance they are live. * choose ssa_names for spilling which will help pressure in hot areas, for as long as the number of live ranges exceed a threshold.
* rewrite the code to 'spill' these ssa_names.


There are bazillions of refinements that can be performed, but it doesn't serve a lot of purpose to discuss them right now.

I have a previous first-take at calculating ssa_name pressure, but nothing beyond that. (ssa-pressure-branch). The fastest approach would be to first take that pressure code, and quickly add the remaining bits to handle simple loads and see if there appears to be enough benefit to continue the work.

One of the primary reasons I think there may be some benefit to this is that non-virtual SSA_NAMESs map to registers at out-of-ssa/expand time. Local variables are loaded into an ssa_name/register and are kept there for their lifetime. Overlapping live ranges are given distinct variables, so when it comes to spilling, we don't have to alias analysis, etc. The net effect of "spilling" here is to simply keep the local variable in memory instead of trying to keep it in a register when we go to RTL. for instance:

a_2 = a
...
a_6 = a_2 + 5
..
b_5 = a_2 * h_3

if a_2 is selected for spilling, the resulting code is simply:

...
a_88 = a
a_6 = a_88 + 5
...
a_89 = a
b_5 = a_89 * h_3

TER already helps with register pressure a bit. When an SSA_NAME is substituted into an expression, the register pressure between the original load location and its use is reduced. TER will currently *only* substitute SSA_NAMES into expressions when there is a single use of the SSA_NAME. There are many times when an SSA_NAME is used a couple of times, and if TER had substituted them the results would be better.

This pressure reduction mechanism can trigger TER automatically if the pressure is high enough. In this example, a_88 and a_89 would both substituted by TER (since they are single def/use), and the desired result is achieved:
...
a = a + 5
...
a = a * h_3

This same mechanism will also result in TER picking up some things that use to cross block boundaries when the situation is appropriate.


The second stage might be to look at SSA_NAMES which are expressions. Any SSA_NAME which is calculated as an expression would simply be stored/loaded to a new temp:

a_2 = b_6 + c_2
...
g_9 = a_2 * 2

becomes

a_2 = b_6 + c_2
tmp.9 = a_2
...
a_88 = tmp.9
g_9 = a_88 * 2


If this also shows some promise of being useful, then we can start looking at rematerializing expressions and other such enhancements which appear to be worthwhile and easy to do.

Reply via email to