Hi,

while developing GCC 15, we found out that the bit-test switch lowering
algorithm is quadratic and that it is possible to construct a switch that
causes huge compile time (PR117091).

Andi Kleen came up with a fast bit-test lowering algorithm which was
1) linear
2) in some cases better than the original algorithm
3) in some cases worse than the original algorithm

For GCC 15 we decided to use the original algorithm when the given switch is
small and the fast algorithm when the switch is too big.

In this patch set I combine both algorithms.  The new algorithm gives strictly
better solutions than Andi's and the original one and runs in linear time.

More details below and in commit messages.  Bootstrapped and regtested on
x86_64 linux.

Ok for trunk?
Filip Kastl


Better solutions
----------------

When I say "better solutions", I mean the number of clusters.  Cluster is a
concept from switch lowering pass.  It is basically a group of cases that can
be handled by constantly many operations.

On every switch the new algorithm should output clustering of the cases such
that the number of clusters is the smallest possible.  This should be
guaranteed by how the algorithm works.

To be extra sure, I tested this by compiling GCC itself.  I bootstrapped GCC
once for each of the three algorithms (the original one, Andi's and the new
one) and I counted the number of clusters for each function (I didn't find a
way to easily compare individual switches).  Indeed the new algorithm either
gave a function with fewer clusters or a function with more switches
(if-to-switch conversion converts if-else chains into switches based on if
switch lowering can lower the resulting switch -- I think that this is also a
win).


A note about --param switch-lower-slow-alg-max-cases
----------------------------------------------------

The param switch-lower-slow-alg-max-cases specified how big a switch had to be
so that GCC would switch to the fast algorithm.  After this patch set it will
serve no purpose.  I kept it though.  There is the jump-table switch lowering
algorithm.  That algorithm is still quadratic.  Nobody ran into any real issues
with that yet, but it would be nice to come up with a fast algorithm for
jump-table lowering and set GCC to fall back to it when a switch is too big
(pr118353).  Then the param would be relevant again.  I plan to look into that
in this Stage 1.


Testing
-------

- I compiled the testcase from PR117091 to confirm that the new algorithm
  spends reasonable time on it.

> gcc.sh -O2 -ftime-report big-switch.c 
Time variable                                  wall           GGC
 ipa SRA                            :  12.58 (  5%)   536  (  0%)
 dominator optimization             :  55.13 ( 22%)  4687k (  1%)
 tree FRE                           :  71.05 ( 29%)   744  (  0%)
 tree switch lowering               :  27.41 ( 11%)    13M (  3%)
 tree modref                        :  13.24 (  5%)   480  (  0%)
 rest of compilation                :  23.09 (  9%)  2816  (  0%)
 TOTAL                              : 248.27          413M
(some entries omitted)

The original bug report claimed that this testcase ran over 30 minutes.  That's
not the case here.  Switch lowering doesn't even dominate the compilation time.

- I also ran a few SPEC CPU 2017 benchmarks (x86_64 linux).

with this patch set:

O2 generic
500.perlbench_r                               NR       1        237       6.72  
*
502.gcc_r                                     NR       1        216       6.57  
*

Ofast native
500.perlbench_r                               NR       1        251       6.33  
*
502.gcc_r                                     NR       1        210       6.73  
*

without this patch set:

O2 generic
500.perlbench_r                               NR       1        242       6.57  
*
502.gcc_r                                     NR       1        216       6.56  
*

Ofast native
500.perlbench_r                               NR       1        253       6.29  
*
502.gcc_r                                     NR       1        210       6.73  
*

The fourth row is the exec time in seconds.  There are only small differences
(2% max) and they favor the new algorithm.  They could just be noise, btw.

I've selected the perl and gcc benchmarks because I expect that those contain
many switches.

- I bootstrapped and regtested this on x86_64 linux.


Filip Kastl (4):
  gimple: Merge slow and fast bit-test switch lowering [PR117091]
  gimple: Make bit-test switch lowering more powerful
  gimple: Don't warn about using different algs for big switch lowering
    [PR117091]
  gimple: Switch bit-test lowering testcases for the more powerful alg

 gcc/testsuite/gcc.dg/tree-ssa/switch-5.c |  60 ++++++
 gcc/testsuite/gcc.dg/tree-ssa/switch-6.c |  51 +++++
 gcc/tree-switch-conversion.cc            | 245 +++++++----------------
 gcc/tree-switch-conversion.h             |  10 -
 4 files changed, 184 insertions(+), 182 deletions(-)
 create mode 100644 gcc/testsuite/gcc.dg/tree-ssa/switch-5.c
 create mode 100644 gcc/testsuite/gcc.dg/tree-ssa/switch-6.c

-- 
2.48.1

Reply via email to