There are a few things that I'm considering working on for the 4.4
release, and I figured I'd see what others thought. Is anyone
considering/doing similar or related work?
I'll summarize each, and then go into more details.
1 - Pass cleanup. There have been rumblings about this, but I haven't
seen a lot of actual progress. We currently run 100's of passes all the
time, and we don't really know how effective or necessary many of them
are. We also don't know whether the cleanups they run are truly doing
much. It would be nice to do some actual analysis and do something with
the results.
2 - Interface to the virtual operands. The virtual operand web provides
what really amounts to low level detail of memory accesses. Every pass
that cares about memory access ends up interpreting the data and dealing
with the VOP web modification itself. This is in some ways analogous to
every pass having to implement bsi_insert() itself. It would be nice if
there was one standard central interpreter of the data that knows how to
view and modify the web. Richi is working on an alias oracle of some
sort, I'm not sure of its details so I don't know how far it delves into
this sort of thing, or what approach it takes.
3 - SRA. There appear to be some deficiencies in SRA, and also how it
interacts with the MEMSSA partitioner. I found when looking at some 4.3
bugs that its never as simple as looking at the base and offset, at
least not consistently. Most of it appears to be
first-implementation-itis, the discovering of new issues as you go.
Stepping back and reviewing the functional requirements and its
interaction with the partitioner should be useful.
4 - SSA pressure reduction. I'm throwing this back on the table. I never
quite got around to it before, and nothing has changed to resolve the
issues. We freely create as much register pressure in the SSA optimizers
as we want (as we should be able to). The backend has doing nothing to
address the issue and the RTL register allocator is simply swamped by
the sheer quantity of live ranges sometimes. Perhaps Vlad's RA will get
in this release, and/or perhaps the need for this will be eliminated by
something else, but it is something that may help code generation in the
short term at least.
Those are the primary subjects, in no particular order. Now I'll delve a
bit deeper into the details of whats involved for each one.
If anyone is interested in any of these tasks, or has any
questions/observations/criticisms, please let me know. (I know you will!
:-)
I will also stick something up on the wiki in a bit with this
information, and whatever other details come out from
suggestions/discussions or further investigation.
Pass Cleanup
-----------------
It would sure be nice to streamline our pass pipeline. One could spend
the rest of ones life doing this by hand. The 2 biggest problems would
seem to be:
* we have no idea whether a pass actually does much
* and we have no idea whether it was actually useful
So I was thinking that maybe we could modify passes to report what they
did. When CHECKING_ENABLED is on (or something like that), every pass
reports what it did, possibly to the pass manager.
Initially I was thinking it might have something like:
* number of statements changed
* number of statements added
* number of statements deleted
* number of names added
* number of names deleted
And then a report is issued for the compilation listing every pass that
ran and a summary of this data for each occurrence. Each of the TODO
cleanups run by each pass should also have its data listed.
Then we could run the compiler over a whack of testcases, accumulate all
these reports, and generate some data on which passes and cleanups are
not doing very much and are candidates for closer inspection.
One could then turn off one or more of these passes, run again, and see
whether there appeared to be much impact on the other passes.
A closer inspection may identify that perhaps the pass should be
somewhere else in the pipeline, run only at -O3, completely eliminated,
or modified in some way.
There are also categories of optimizations:
* optimization performers - The workhorses that actually do useful things.
* optimization enablers - introduce situations which will enable a
later optimizer.
* cleanups - Those which remove crud or undo enabler work which was
not profitable.
And these should probably be treated differently. Enablers tend to work
in concert with optimizations and sometime also cleanups. You need to
look at the data for them together to see if useful work was done. If
the cleanup is usually undoing everything the enabler did, then the
enabler isn't really enabling, its just chewing cycles :-) (or there is
a flaw in the optimization, in any case, it all deserves a closer look
if the group isn't accomplishing much)
It also possible that a pass should only be run on a specific
architecture or set of arches. I see no reason why we shouldn't allow
the pass pipeline to be tuned for specific architectures. Not
everything that is good for a 32 register machine is good for one with 8
registers and vice versa.
This could then be further extended into the RTL passes, and there are
some other extensions that could be useful. It would be nice for
instance if we could statically guess at whether the runtime was
affected or not, and by how much. Many modifications that optimizations
make aren't really going to be measurable by simply testing execution speed.
* The scheduler perhaps could spit out a summary of what it thinks the
number of cycles through the execution of the predicted path/key/all
blocks would be.
* The loop optimizers could submit summary info about loops, and tied
in with the scheduler info, we could guess at whether an optimization
affected the cycles in a loop by generating reports with and without the
pass and comparing the cycles estimate of the core loops and main path.
* We can also compare code size based on the object code produced, and
could work on the -Os pass pipeline as well.
* these reports would be useful for some of the automated pass
shuffling experiments as well I would think.
And so on. Once properly set up, you could actually automate quite a bit
of this and maybe get some very interesting data.
I think it would also be a good idea to set up a generic logging
mechanism for this. There are other tasks within the compiler that a
generic logging mechanism would be useful for. I've seen requests for
optimizers to generate reports on what they did or didn't do and why,
providing hints to the programmer about how to change their code to get
better optimizations. I think we even had/have a branch for this sort
of thing. It seems like a good opportunity to get something generic in
place for future use.
Interface to Virtual Operands
-------------------------------------
I'd like to do a survey of all the optimizations which use virtual
operands to see how they use it and how they manipulate it. From that,
we can extrapolate the kinds of questions that are asked and the kinds
of manipulations that are performed. In particular look for slight
variations in the interpretation or manipulation of the VOP data.
Until that survey is done, its hard to predict what the interface
routines would be. There are certainly a base set such as
* can a load/store to B be inserted before/after this stmt
* can a load/store from B be moved from stmt1 to stmt2 safely
* if not, which stmt(s) are the blockers
* insert a store to B.
* etc, etc.
The survey would help fill out this list and identify commonly done
tasks and needs. We do enough stuff now that it would make a pretty good
representational data set I think.
I expect some passes hold a modified state within the pass which isn't
reflected in the VOP web. The interpreter could maybe hold an internal
modified state as well where the interface allows a pass to say 'I'm
pretending to insert this stmt , and that stmt is fake-deleted'. It
would then allow the queries to reflect the situation based on these
modifications. This state could be a state stack if need be for
unwinding changes. This would prevent optimizations from having to take
care of all this crap themselves, and simplify maintenance and future
coding.
That might turn out to be quite difficult, its hard to say without the
initial survey of what and how things are done. We certainly have
enough passes using the information now that we can get a feel for what
common ground there is for interpreting the VOP web.
In thinking about a pressure reduction pass for instance, it would be
much simpler to write it if these facilities were in place as general
routines, even if it were just the query routines. I'm assuming Richi's
oracle would address some of this?
The first step would simply be to conduct the survey and see what has to
be dealt with.
SRA
------
This is a shorter subject since I know less about it at the moment, just
observations mostly.
There ought to be a simple and consistent way to look at the data in two
SRA'd elements and tell whether there is a conflict or equivilency.
Right now there appears to be issues with sub-structues such that the
offset and length are not directly comparable all the time because the
second element might be off of a different base. Gross simplification,
but it came up when delving into a PR for 4.3. It might be as simple as
a common routine that can look at the different aspects and figure it out.
Since we have a working implementation, it should be possible identify
the deficiencies we do have and figure out a way to treat them consistently.
SRA elements don't interact properly with the MEMSSA partitioning
either. The partitioner only sees the first element of a structure and
therefore large structures end up not triggering the partitioner. This
leaves large structures with, say 300 elements unpartitioned, and
results in exceptionally large numbers of virtual ops, which is exactly
what MEMSSA is suppose to address in the first place. The operand code
and partitioning code currently interpret this differently, and there
are places in the operand code which don't even attempt to look at
offset, just assumes they always overlap.
I see three main focuses so far:
* Create a mechanism for consistency with offsets, lengths, and bases
* Find and change the places that currently just give up, either without
trying or making half hearted efforts
* Make SRA'd items interact with the MEMSSA partitioning better.
Additional comments and data is welcome :-)
SSA_NAME pressure reduction
-----------------------------------------
The intent here is not to do the job of the register allocator, nor is
it to spend a lot of time on a disposable optimization. I suggested this
a couple of years ago as a component of rewriting out-of-ssa and
combining it with a rewrite of expand. I suggest it here again simply
as a pass run in the middle of out-of-ssa. Once out-of-ssa has
determined what ssa_names are going to coalesce, the live range info is
somewhat representative of what will be generated during expand.
We currently have situations where a lot of optimizations happen, and we
end up providing a function to the backend which has 200 registers live
at once at some point. If RA is trying to allocate the function to an
architecture with 8 registers, it has an awful lot of work to do.
The existing register allocator does a decent job if it doesn't have to
spill. I think it probably does an acceptable job even if it has to
spill a bit. It does appear to break down badly if it has to spill a lot.
The correct long term solution is to solve the problem in the register
allocator. As we all know, this is not a trivial task. Vlad has IRA on
the horizon and I believe it is targeted for 4.4. I don't believe it
applies to all architectures yet, and I'm not aware of anything else in
the 4.4 timeline that can help.
So as a temporary solution (until a proper one presents itself), I
suggest that a pre-spiller serves the purpose. Take a function which has
way too many live ranges, and pre-spill some of the values to make the
function more amenable to our register allocator. If you are targeting 8
registers, then reduce the pressure from 200 down to 11 or 12 peak,
something more manageable. The exact number would be found during tuning.
The ideal place to do this would be right before RA. Thats when all the
RTL optimizations have run, and its close to what the register allocator
is going to see, so the data is the most accurate. If someone want to
try that, bonus, that would be great. The new DF infrastructure may
help, but I think a lot of useful information is already gone by that
point.
It seems like a reasonably easy job to do on SSA form. The data isn't as
accurate as it would be at RA time, but you can get the general feel and
do some en-mass lifetime reductions fairly cheaply and quickly. If there
are 200 SSA_NAMES live on entry to BB12, it is likely to help if we can
reduce that to a more reasonable number.
Generally speaking, when the register allocator spills, it will store a
value right after it is defined, and reload the value just before each
use. If a value can recomputed, then you can avoid the store and simply
recompute (aka rematerialize) the value just before it is used. This
technique has the property of reducing the live range pressure over the
statements between the def and each use.
Pressure reduction general process:
* calculate the live ranges of all non-virtual ssa_name.
* calculate a spill cost for each ssa_name. This is a factor of things
like how many uses there are, whether the def is recomputable, whether
occurences are inside loops, and over what distance they are live.
* choose ssa_names for spilling which will help pressure in hot areas,
for as long as the number of live ranges exceed a threshold.
* rewrite the code to 'spill' these ssa_names.
There are bazillions of refinements that can be performed, but it
doesn't serve a lot of purpose to discuss them right now.
I have a previous first-take at calculating ssa_name pressure, but
nothing beyond that. (ssa-pressure-branch). The fastest approach would
be to first take that pressure code, and quickly add the remaining bits
to handle simple loads and see if there appears to be enough benefit to
continue the work.
One of the primary reasons I think there may be some benefit to this is
that non-virtual SSA_NAMESs map to registers at out-of-ssa/expand time.
Local variables are loaded into an ssa_name/register and are kept there
for their lifetime. Overlapping live ranges are given distinct
variables, so when it comes to spilling, we don't have to alias
analysis, etc. The net effect of "spilling" here is to simply keep the
local variable in memory instead of trying to keep it in a register when
we go to RTL. for instance:
a_2 = a
...
a_6 = a_2 + 5
..
b_5 = a_2 * h_3
if a_2 is selected for spilling, the resulting code is simply:
...
a_88 = a
a_6 = a_88 + 5
...
a_89 = a
b_5 = a_89 * h_3
TER already helps with register pressure a bit. When an SSA_NAME is
substituted into an expression, the register pressure between the
original load location and its use is reduced. TER will currently *only*
substitute SSA_NAMES into expressions when there is a single use of the
SSA_NAME. There are many times when an SSA_NAME is used a couple of
times, and if TER had substituted them the results would be better.
This pressure reduction mechanism can trigger TER automatically if the
pressure is high enough. In this example, a_88 and a_89 would both
substituted by TER (since they are single def/use), and the desired
result is achieved:
...
a = a + 5
...
a = a * h_3
This same mechanism will also result in TER picking up some things that
use to cross block boundaries when the situation is appropriate.
The second stage might be to look at SSA_NAMES which are expressions.
Any SSA_NAME which is calculated as an expression would simply be
stored/loaded to a new temp:
a_2 = b_6 + c_2
...
g_9 = a_2 * 2
becomes
a_2 = b_6 + c_2
tmp.9 = a_2
...
a_88 = tmp.9
g_9 = a_88 * 2
If this also shows some promise of being useful, then we can start
looking at rematerializing expressions and other such enhancements which
appear to be worthwhile and easy to do.