Re: On-Demand range technology [4/5] - Performance results

Richard Biener Thu, 23 May 2019 06:30:10 -0700

On Thu, May 23, 2019 at 3:29 AM Andrew MacLeod <amacl...@redhat.com> wrote:
>
> We have done extensive performance analysis to help address concerns
> about the nature of an on-demand model. LLVM made an attempt at
> something similar,  but suffered from significant performance issues
> they could not solve with their approach. This approach is not the same,
> and we have seen no sign of pathological cases which are problematic.
>
> I have trolled bugzilla looking for large problematic test cases, and
> have tried a couple of test cases from LLVM which dberlin pointed out
> they found to be problematic.  To date, I have found nothing that shows
> any kind of performance problem.  I am more than happy to entertain
> whatever cases y’all might want to throw my way.
>
> For a test of robustness, we have built a complete Fedora distribution
> consisting of 9174 packages with the ranger branch. All packages except
> 3 build successfully and pass the relevant regression tests. It appears
> 2 of them are related to RVRP and are still under analysis.
>
> Our primary performance testbase to date has been the compiler itself.
> We compile 242 .ii files from a stage1 compiler build, and compare times
> to EVRP. VRP is quite a bit slower, and although we do have an iterative
> update infrastructure, comparisons aren’t quite fair since we don’t do
> all equivalence and bitmask operations it does yet.
>
> Before going into the numbers, I would like to visit a minor issue we
> have with switches. RVRP works from the bottom up, so in order to
> evaluate a range, it begins by getting the constant range for the LHS of
> the branch from the edge. For a conditional it is trivially [0,0] or
> [1,1] depending on TRUE or FALSE edge.
>
> For a switch, it turns out GCC gimple representation has no simple way
> to figure this out. As a result, we need to loop over every edge in the
> switch and then union together all the cases which share that edge, or
> in the default case, intersect out all the other cases. This turns out
> to be *very* time consuming in tests cases with very large switches,
> somewhere in the vicinity of O(n^3). Ugg.  So the ranger incurs a fair
> amount of overhead trying to evaluate, and re-evaluate these constant
> edges.
>
> There are ways to address this… but for now we will present performance
> numbers with different switch configurations, each of the 5
> configurations listed here:
>      1 - Calculating ranges outright  from the stock branch.
>      2 - Timers adjusted to exclude switch edge calculation code (i.e.
> pretend the range is available on the edge like it is with TRUE and FALSE.
>      3 - Do not process switches.  We spend extra time on switches
> because we always attempt to calculate ranges very precisely as if we
> had infinite precision. There is room for trimming outcomes here, but we
> have made no attempt yet.
>      4 - Just like EVRP,  RVRP currently includes building the
> dominators and integrating calls into SCEV at the beginning of each
> block to see if there are any loop range refinements.  The ranger has no
> need for this to operate, and many passes will not care.  So we produce
> a 4th number for RVRP where we don’t build any of the infrastructure it
> doesn’t need.
>      5 - RVRP can also run in conditional-only mode. Rather than walking
> the entire IL trying to resolve every range, it can simply look at the
> last statement of every block asking if the branch can be folded.  This
> gets a lot of what vrp gets that affects the CFG and could be utilized
> in either lower optimization levels, or as VRP if we can push all the
> other activities it does into appropriate optimizations (such as making
> CCP range aware).      **NOTE**: This *still* calculates switch ranges,
> so includes that slow down.
>
> All times are with a release configured compiler.  Out of the 242 files,
> pretty much across the board in all 4 sets of figures, RVRP was faster
> in about 90% of the cases, and slower in the other 10%, resulting in the
> following cumulative totals.
>
>                              Overall (242 files)
> 1 - Raw RVRP                              22% slower   (**a**)
> 2 - No edge calculation(1)                4.5% slower
> 3 - No switch processing(2)                9.5% faster
> 4 - No dominators(3)                    16% faster  (**b**)
> 5 - Conditionals (including switches) only        4% faster


I don't understand a vs. b here.  Does this mean dominators
cost 38% performance?!

Does 4 imply 3 and 2 maybe?

Dominators cannot be this expensive.

> These numbers indicate that large switches are the primary cause when
> RVRP is slower.  We have various approaches we could use to address
> this.   Removing the time spent building unnecessary dominators shows a
> significant improvement as well.
>
> We also have the results for the time spent in the passes we converted:
>
>                  Overall (242 files)
> 1 - -wrestrict            19% faster
> 2 - -wprintf            95% faster
> 3 - -walloca            1% slower
>
> -wrestrict has the dominator walk removed since it is no longer needed,
> and simply calls into a ranger to get the range.
>
> -wprintf has had the dominator build removed, as well as the EVRP range
> walk. It really benefits from very few ranges being requested… so the on
> demand approach is a big win here since we only calculate what we
> actually need to answer a small number of questions.
>
> -walloca is a touch slower because we are doing more work.  The original
> pass simply queried for global range information. We replaced calls to
> SSA_NAME_RANGE_INFO() with calls to the ranger to get accurate ranges.
> So this overhead is the amount required to calculate those ranges.  The
> Walloca pass now handles more things (for instance we fix
> gcc.dg/Walloca-6.c), and we catch more issues on a typical bootstrap.
> For example, we find a possible out of bounds VLA in libgomp/task.c.
>
> We are also working on integrating the on-demand ranges with the
> backwards threader.  It queries ranges over each potential path to see
> if a collapsable branch can be found. It then uses that information to
> find better threading opportunities than is currently possible.  Results
> thus far have been very promising.
>
> These numbers are all indicative that this approach is viable versus the
> existing approach, and is quite frequently faster. It already has
> iterative back-edge resolution integrated, and we can get cases that the
> existing VRP and EVRP approach have difficulty with.
>
> Comments and feedback always welcome!
> Thanks
> Andrew
>

Re: On-Demand range technology [4/5] - Performance results

Reply via email to