On-Demand range technology [4/5] - Performance results

Andrew MacLeod Wed, 22 May 2019 18:30:15 -0700

We have done extensive performance analysis to help address concernsabout the nature of an on-demand model. LLVM made an attempt atsomething similar, but suffered from significant performance issuesthey could not solve with their approach. This approach is not the same,and we have seen no sign of pathological cases which are problematic.

I have trolled bugzilla looking for large problematic test cases, andhave tried a couple of test cases from LLVM which dberlin pointed outthey found to be problematic. To date, I have found nothing that showsany kind of performance problem. I am more than happy to entertainwhatever cases y’all might want to throw my way.

For a test of robustness, we have built a complete Fedora distributionconsisting of 9174 packages with the ranger branch. All packages except3 build successfully and pass the relevant regression tests. It appears2 of them are related to RVRP and are still under analysis.

Our primary performance testbase to date has been the compiler itself.We compile 242 .ii files from a stage1 compiler build, and compare timesto EVRP. VRP is quite a bit slower, and although we do have an iterativeupdate infrastructure, comparisons aren’t quite fair since we don’t doall equivalence and bitmask operations it does yet.

Before going into the numbers, I would like to visit a minor issue wehave with switches. RVRP works from the bottom up, so in order toevaluate a range, it begins by getting the constant range for the LHS ofthe branch from the edge. For a conditional it is trivially [0,0] or[1,1] depending on TRUE or FALSE edge.

For a switch, it turns out GCC gimple representation has no simple wayto figure this out. As a result, we need to loop over every edge in theswitch and then union together all the cases which share that edge, orin the default case, intersect out all the other cases. This turns outto be *very* time consuming in tests cases with very large switches, somewhere in the vicinity of O(n^3). Ugg. So the ranger incurs a fairamount of overhead trying to evaluate, and re-evaluate these constantedges.

There are ways to address this… but for now we will present performancenumbers with different switch configurations, each of the 5configurations listed here:

    1 - Calculating ranges outright  from the stock branch.

2 - Timers adjusted to exclude switch edge calculation code (i.e.pretend the range is available on the edge like it is with TRUE and FALSE. 3 - Do not process switches. We spend extra time on switchesbecause we always attempt to calculate ranges very precisely as if wehad infinite precision. There is room for trimming outcomes here, but wehave made no attempt yet. 4 - Just like EVRP, RVRP currently includes building thedominators and integrating calls into SCEV at the beginning of eachblock to see if there are any loop range refinements. The ranger has noneed for this to operate, and many passes will not care. So we producea 4th number for RVRP where we don’t build any of the infrastructure itdoesn’t need. 5 - RVRP can also run in conditional-only mode. Rather than walkingthe entire IL trying to resolve every range, it can simply look at thelast statement of every block asking if the branch can be folded. Thisgets a lot of what vrp gets that affects the CFG and could be utilizedin either lower optimization levels, or as VRP if we can push all theother activities it does into appropriate optimizations (such as makingCCP range aware). **NOTE**: This *still* calculates switch ranges,so includes that slow down.

All times are with a release configured compiler. Out of the 242 files,pretty much across the board in all 4 sets of figures, RVRP was fasterin about 90% of the cases, and slower in the other 10%, resulting in thefollowing cumulative totals.


                            Overall (242 files)
1 - Raw RVRP                              22% slower
2 - No edge calculation(1)                4.5% slower
3 - No switch processing(2)                9.5% faster
4 - No dominators(3)                    16% faster
5 - Conditionals (including switches) only        4% faster

These numbers indicate that large switches are the primary cause whenRVRP is slower. We have various approaches we could use to addressthis. Removing the time spent building unnecessary dominators shows asignificant improvement as well.


We also have the results for the time spent in the passes we converted:

                Overall (242 files)
1 - -wrestrict            19% faster
2 - -wprintf            95% faster
3 - -walloca            1% slower

-wrestrict has the dominator walk removed since it is no longer needed,and simply calls into a ranger to get the range.

-wprintf has had the dominator build removed, as well as the EVRP rangewalk. It really benefits from very few ranges being requested… so the ondemand approach is a big win here since we only calculate what weactually need to answer a small number of questions.

-walloca is a touch slower because we are doing more work. The originalpass simply queried for global range information. We replaced calls toSSA_NAME_RANGE_INFO() with calls to the ranger to get accurate ranges. So this overhead is the amount required to calculate those ranges. TheWalloca pass now handles more things (for instance we fixgcc.dg/Walloca-6.c), and we catch more issues on a typical bootstrap. For example, we find a possible out of bounds VLA in libgomp/task.c.

We are also working on integrating the on-demand ranges with thebackwards threader. It queries ranges over each potential path to seeif a collapsable branch can be found. It then uses that information tofind better threading opportunities than is currently possible. Resultsthus far have been very promising.

These numbers are all indicative that this approach is viable versus theexisting approach, and is quite frequently faster. It already hasiterative back-edge resolution integrated, and we can get cases that theexisting VRP and EVRP approach have difficulty with.


Comments and feedback always welcome!
Thanks
Andrew

On-Demand range technology [4/5] - Performance results

Reply via email to