We have done extensive performance analysis to help address concerns
about the nature of an on-demand model. LLVM made an attempt at
something similar, but suffered from significant performance issues
they could not solve with their approach. This approach is not the same,
and we have seen no sign of pathological cases which are problematic.
I have trolled bugzilla looking for large problematic test cases, and
have tried a couple of test cases from LLVM which dberlin pointed out
they found to be problematic. To date, I have found nothing that shows
any kind of performance problem. I am more than happy to entertain
whatever cases y’all might want to throw my way.
For a test of robustness, we have built a complete Fedora distribution
consisting of 9174 packages with the ranger branch. All packages except
3 build successfully and pass the relevant regression tests. It appears
2 of them are related to RVRP and are still under analysis.
Our primary performance testbase to date has been the compiler itself.
We compile 242 .ii files from a stage1 compiler build, and compare times
to EVRP. VRP is quite a bit slower, and although we do have an iterative
update infrastructure, comparisons aren’t quite fair since we don’t do
all equivalence and bitmask operations it does yet.
Before going into the numbers, I would like to visit a minor issue we
have with switches. RVRP works from the bottom up, so in order to
evaluate a range, it begins by getting the constant range for the LHS of
the branch from the edge. For a conditional it is trivially [0,0] or
[1,1] depending on TRUE or FALSE edge.
For a switch, it turns out GCC gimple representation has no simple way
to figure this out. As a result, we need to loop over every edge in the
switch and then union together all the cases which share that edge, or
in the default case, intersect out all the other cases. This turns out
to be *very* time consuming in tests cases with very large switches,
somewhere in the vicinity of O(n^3). Ugg. So the ranger incurs a fair
amount of overhead trying to evaluate, and re-evaluate these constant
edges.
There are ways to address this… but for now we will present performance
numbers with different switch configurations, each of the 5
configurations listed here:
1 - Calculating ranges outright from the stock branch.
2 - Timers adjusted to exclude switch edge calculation code (i.e.
pretend the range is available on the edge like it is with TRUE and FALSE.
3 - Do not process switches. We spend extra time on switches
because we always attempt to calculate ranges very precisely as if we
had infinite precision. There is room for trimming outcomes here, but we
have made no attempt yet.
4 - Just like EVRP, RVRP currently includes building the
dominators and integrating calls into SCEV at the beginning of each
block to see if there are any loop range refinements. The ranger has no
need for this to operate, and many passes will not care. So we produce
a 4th number for RVRP where we don’t build any of the infrastructure it
doesn’t need.
5 - RVRP can also run in conditional-only mode. Rather than walking
the entire IL trying to resolve every range, it can simply look at the
last statement of every block asking if the branch can be folded. This
gets a lot of what vrp gets that affects the CFG and could be utilized
in either lower optimization levels, or as VRP if we can push all the
other activities it does into appropriate optimizations (such as making
CCP range aware). **NOTE**: This *still* calculates switch ranges,
so includes that slow down.
All times are with a release configured compiler. Out of the 242 files,
pretty much across the board in all 4 sets of figures, RVRP was faster
in about 90% of the cases, and slower in the other 10%, resulting in the
following cumulative totals.
Overall (242 files)
1 - Raw RVRP 22% slower
2 - No edge calculation(1) 4.5% slower
3 - No switch processing(2) 9.5% faster
4 - No dominators(3) 16% faster
5 - Conditionals (including switches) only 4% faster
These numbers indicate that large switches are the primary cause when
RVRP is slower. We have various approaches we could use to address
this. Removing the time spent building unnecessary dominators shows a
significant improvement as well.
We also have the results for the time spent in the passes we converted:
Overall (242 files)
1 - -wrestrict 19% faster
2 - -wprintf 95% faster
3 - -walloca 1% slower
-wrestrict has the dominator walk removed since it is no longer needed,
and simply calls into a ranger to get the range.
-wprintf has had the dominator build removed, as well as the EVRP range
walk. It really benefits from very few ranges being requested… so the on
demand approach is a big win here since we only calculate what we
actually need to answer a small number of questions.
-walloca is a touch slower because we are doing more work. The original
pass simply queried for global range information. We replaced calls to
SSA_NAME_RANGE_INFO() with calls to the ranger to get accurate ranges.
So this overhead is the amount required to calculate those ranges. The
Walloca pass now handles more things (for instance we fix
gcc.dg/Walloca-6.c), and we catch more issues on a typical bootstrap.
For example, we find a possible out of bounds VLA in libgomp/task.c.
We are also working on integrating the on-demand ranges with the
backwards threader. It queries ranges over each potential path to see
if a collapsable branch can be found. It then uses that information to
find better threading opportunities than is currently possible. Results
thus far have been very promising.
These numbers are all indicative that this approach is viable versus the
existing approach, and is quite frequently faster. It already has
iterative back-edge resolution integrated, and we can get cases that the
existing VRP and EVRP approach have difficulty with.
Comments and feedback always welcome!
Thanks
Andrew