Hi,
while looking into problems of current and pretty-ipa's inlining heuristics
implementation, I noticed that we have relatively important problems with EH
overhead that confuse inliner code size metrics.
Looking deeper into EH problems, I think main issues are:
- Inliner tends to produce exponential amount of destructor calls on programs
with high abstraction penalty (pooma/boost/DLV all seem to have this).
Every scope block such as { class a a;.....} contains implicit cleanup.
Destructor of class A is called twice, once for ned of block, once for
cleanup.
Now it is common for destructors to call destuctors of contained objects
and local iterators, this all happens in same way causing EH cleanup code
that often exceed main function body.
- Not inlining destructors in cleanups cause otherwise fully inlined object
to miss SRA/store sinking and other optimizations.
- We tend to block optimization because variable along hot path has abnormal
PHI. Interestingly this usually limits to creating more copies, PHIs,
conditionals and BBs that gets almost completely cleaned up later on RTL
land, but I don't think it is good excuse to ignore these ;)
- Cleanups calling destructors are often fully removable but we
don't do that very well. We remove completely dead destructor by means
of new local-pureconst pass. As soon as destructor has some stores
or loops, we realize the fact that they are dead very late in compilation.
Also destructors calling delete() on some fields are not simplified. This
is partly doable via Martin's IPA-SRA.
I thus started to look into improvements in EH machinery and elimination
of code in the cleanups
What is already done on mainline:
- New ehcleanup pass: when EH handler becomes empty by
local pureconst+DCE+DSE+complette unrolling combo, we can safely eliminate
it. This further eliminate local nothrow region handlers that are
numberous.
- There is new local pureconst pass done during early optimization.
It tries to prove stuff nothrow and pure. This leads to DCEing the
destructors before they are accounted in inline metricts.
- Some of bits from RTL EH lowering are gone now. More should come
We really have here partial transition from RTL EH code to gimple.
This will hopefully simplify RTL EH code to basic part adding
landing pads.
What is already done on pretty-ipa:
- Inliner herusitics can be bit smarter about code size estimates on
how code will look after inlining. In particular MUST_NOT_THROW
receivers are most likely
optimized away or commonized and can be ignored. Also all the
FILTER_EXPR/OBJ_REF_EXPR
sets/reads are probably going to disappear after lowering.
What I have implementation for and would like to push to pretty-ipa and
later for mainline after some more evaulation:
- EH edge redirection. This is conceptually easy thing to do. When EH edge
needs to be redirected and it is only edge to BB, we just update label in
tree of EH handlers.
When it is not only edge, we walk from the throwing statement region up to
the outermost handler region in the EH tree and duplicate everything on a
way. This breaks some assumptions EH code have. In particular on EH
handler can do RESX in other handler and handlers can share labels.
These assumptions are made in except.c, but they don't have to be: EH tree
is really just on-side representaiton of decision tree of what happens after
expression and is lowered to such for for dwarf. There is no need to know
what happens after EH is delivered.
While I don't expect immediate improvements in C++ codegen.
I benchmarked it in C++ testers and there is some improvment in
libstdc++ tests, little improvement in boot's wave and cwcessboard. Partly
it is because all testcases we have do pretty much no EH except for one
done implicitly by cleanup regions. The testcases improving in libstdc++
do have try....catch constructs in them. We probably could try to find
some new benchmarks for C++ testsuite to cover this area as well as cases
where it is not best solution to inline everything completely. Probably
something like mozilla would do the job, but I don't know if there is easy
to compile benchmark aplication for this. Our C++ testsuite is pretty much
testsuite for specific kind of C++ coding requiring inliner to do a lot
of job. This is not surprising because it was constructed with inliner
tunning in mind.
I think this is important infrastructure bit. In particular it seems to
me that since this leaves us with abnormal edges produced by
setjmp/longjmp, nonlocal labels and computed goto and because of computed
goto factoring, only longjmp/setjmp edges are really unsplittable and thus
we can get rid of abnormal phi handling and teach out-of-ssa to insert
conditionals into abnormal edge receiving BB to resolve the unlikely case
we will make copy on abnormal edge neccesary.
Redirecting EH edges also allows sinking code needed only on EH path down
to the edges that is quite common because of inlining differences above.
- EH region merging: after critical edge splitting we produce a lot of copies
of EH regions by EH edge redirecition. It is simple to merge them again
in ehcleanup by hashing them and proving equivalence.
- It is possible to DCE FILTER_EXPR (and OBJ_REF_EXPR). I use simple code
that try to verify that value stored to FILTER_EXPR is read in same BB (and
thus provably don't change value as described bellow) and do not set the
magic
needed bit on it in that case. This kills most of FILTER_EXPR sets.
Some more or less experimental stuff I have:
- We should finish the transition and move as much as possible of EH lowering
up to gimple world. Producing post-landing pads in gimple is easy (they
are really just conditionals on FILTER_EXPR/OBJ_REF_EXPR) and lowering
RESX to goto is easy too.
This allows scalar optimizations on the produced code.
There is problem that after lowering it is no longer easy to do dead
cleanup elimination. So we probably should do this transform somewhere
in half of our optimization queue.
I also experimented with lowering only the trivial cases pre-inlining.
Some remaining issues:
- FILTER_EXPR/OBJ_REF_EXPR is currently handled in quite dangerous way.
Original Rth's code made them quite 100% volatile. Now we can PRE them.
The FILTER_EXPR/OBJ_REF_EXPR are really hard registers in RTL world that
are set by EH runtime on EH edges from throwing statement to handler
(not at RESX edges) and they are used by pre-landing pads and also by
RESX code.
It would be more precise if RESX instruction took FILTER_EXPR/OBJ_REF_EXPR
value as argument (since it really uses the values) so we can kill magicness
of the sets. Problem is that I don't think there is good SSA representation
for register that implicitly change over edges.
Take the example
call (); EH edge to handler 1:
....
receiver 1:
tmp1 = filter_expr;
tmp2 = obj_ref_expr;
call2 (); EH edge to handler 2
label1:
filter_expr = tmp1
obj_ref_expr = tmp2
resx (tmp1, tmp2)
handler 2:
tmp3 = filter_expr;
tmp4 = obj_ref_expr;
if (conditional)
goto label1:
else
filter_expr = tmp3
obj_ref_expr = tmp4
resx (tmp3, tmp4);
In this case tmp1 != tmp3 and tmp2 != tmp4 and thus it is invalid to
optimize
second resx to (tmp1, tmp3). There is nothing to model this.
I wonder if FILTER_EXPR/OBJ_REF_EXPR can't be best handled by just being
volatile to majority of optimizations (i.e. assumed to change value all the
time)
and handle this in copyprop/PRE as specal case invalidating the value across
EH edges?
- The nature of code duplication in between cleanup at end of block and
cleanup in EH actually brings a lot of tail merging possibilities.
I wonder if we can handle this somehow effectivly on SSA. In RTL world
crossjumping would catch thse cases if it was not almost 100% ineffective
by fact that we hardly re-use same register for temporaries.
I wonder if there is resonable SSA optimization that would have similar
effect as tail merging here or if we want to implement tail merging on
gimple.
- Can we somehow conclude that structure being desturcted dies after the
destructors so all writes to it are dead already in early optimizations?
That would allow a lot more DSE and cheaper inlining.
- It is possible to make EH edges redirectable on RTL too. I wonder
if it is worth the effort however.
- We ought to be able to prove finitarity of simple loops so we can
DCE more early and won't rely on full loop unrolling to get rid of
empty loops originally initializing dead arrays.
Honza