On Thu, May 23, 2019 at 3:29 AM Andrew MacLeod <amacl...@redhat.com> wrote: > > We have done extensive performance analysis to help address concerns > about the nature of an on-demand model. LLVM made an attempt at > something similar, but suffered from significant performance issues > they could not solve with their approach. This approach is not the same, > and we have seen no sign of pathological cases which are problematic. > > I have trolled bugzilla looking for large problematic test cases, and > have tried a couple of test cases from LLVM which dberlin pointed out > they found to be problematic. To date, I have found nothing that shows > any kind of performance problem. I am more than happy to entertain > whatever cases y’all might want to throw my way. > > For a test of robustness, we have built a complete Fedora distribution > consisting of 9174 packages with the ranger branch. All packages except > 3 build successfully and pass the relevant regression tests. It appears > 2 of them are related to RVRP and are still under analysis. > > Our primary performance testbase to date has been the compiler itself. > We compile 242 .ii files from a stage1 compiler build, and compare times > to EVRP. VRP is quite a bit slower, and although we do have an iterative > update infrastructure, comparisons aren’t quite fair since we don’t do > all equivalence and bitmask operations it does yet. > > Before going into the numbers, I would like to visit a minor issue we > have with switches. RVRP works from the bottom up, so in order to > evaluate a range, it begins by getting the constant range for the LHS of > the branch from the edge. For a conditional it is trivially [0,0] or > [1,1] depending on TRUE or FALSE edge. > > For a switch, it turns out GCC gimple representation has no simple way > to figure this out. As a result, we need to loop over every edge in the > switch and then union together all the cases which share that edge, or > in the default case, intersect out all the other cases. This turns out > to be *very* time consuming in tests cases with very large switches, > somewhere in the vicinity of O(n^3). Ugg. So the ranger incurs a fair > amount of overhead trying to evaluate, and re-evaluate these constant > edges. > > There are ways to address this… but for now we will present performance > numbers with different switch configurations, each of the 5 > configurations listed here: > 1 - Calculating ranges outright from the stock branch. > 2 - Timers adjusted to exclude switch edge calculation code (i.e. > pretend the range is available on the edge like it is with TRUE and FALSE. > 3 - Do not process switches. We spend extra time on switches > because we always attempt to calculate ranges very precisely as if we > had infinite precision. There is room for trimming outcomes here, but we > have made no attempt yet. > 4 - Just like EVRP, RVRP currently includes building the > dominators and integrating calls into SCEV at the beginning of each > block to see if there are any loop range refinements. The ranger has no > need for this to operate, and many passes will not care. So we produce > a 4th number for RVRP where we don’t build any of the infrastructure it > doesn’t need. > 5 - RVRP can also run in conditional-only mode. Rather than walking > the entire IL trying to resolve every range, it can simply look at the > last statement of every block asking if the branch can be folded. This > gets a lot of what vrp gets that affects the CFG and could be utilized > in either lower optimization levels, or as VRP if we can push all the > other activities it does into appropriate optimizations (such as making > CCP range aware). **NOTE**: This *still* calculates switch ranges, > so includes that slow down. > > All times are with a release configured compiler. Out of the 242 files, > pretty much across the board in all 4 sets of figures, RVRP was faster > in about 90% of the cases, and slower in the other 10%, resulting in the > following cumulative totals. > > Overall (242 files) > 1 - Raw RVRP 22% slower (**a**) > 2 - No edge calculation(1) 4.5% slower > 3 - No switch processing(2) 9.5% faster > 4 - No dominators(3) 16% faster (**b**) > 5 - Conditionals (including switches) only 4% faster
I don't understand a vs. b here. Does this mean dominators cost 38% performance?! Does 4 imply 3 and 2 maybe? Dominators cannot be this expensive. > These numbers indicate that large switches are the primary cause when > RVRP is slower. We have various approaches we could use to address > this. Removing the time spent building unnecessary dominators shows a > significant improvement as well. > > We also have the results for the time spent in the passes we converted: > > Overall (242 files) > 1 - -wrestrict 19% faster > 2 - -wprintf 95% faster > 3 - -walloca 1% slower > > -wrestrict has the dominator walk removed since it is no longer needed, > and simply calls into a ranger to get the range. > > -wprintf has had the dominator build removed, as well as the EVRP range > walk. It really benefits from very few ranges being requested… so the on > demand approach is a big win here since we only calculate what we > actually need to answer a small number of questions. > > -walloca is a touch slower because we are doing more work. The original > pass simply queried for global range information. We replaced calls to > SSA_NAME_RANGE_INFO() with calls to the ranger to get accurate ranges. > So this overhead is the amount required to calculate those ranges. The > Walloca pass now handles more things (for instance we fix > gcc.dg/Walloca-6.c), and we catch more issues on a typical bootstrap. > For example, we find a possible out of bounds VLA in libgomp/task.c. > > We are also working on integrating the on-demand ranges with the > backwards threader. It queries ranges over each potential path to see > if a collapsable branch can be found. It then uses that information to > find better threading opportunities than is currently possible. Results > thus far have been very promising. > > These numbers are all indicative that this approach is viable versus the > existing approach, and is quite frequently faster. It already has > iterative back-edge resolution integrated, and we can get cases that the > existing VRP and EVRP approach have difficulty with. > > Comments and feedback always welcome! > Thanks > Andrew >