https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114855

--- Comment #13 from Richard Biener <rguenth at gcc dot gnu.org> ---
Most of the -O1 dom time is spent in threading using path ranger to simplify
the JT conditions.  That in turn does (for each threading from scratch?)
GORI computes with most of the time spent in range_from_dom in the
gori_compute::may_recompute_p cycle (and there doing bitmap operations).
That compute has a 'depth' --param but it looks like range_from_dom
doesn't and we have a very deep dominator tree for this testcase.

What's also oddly expensive (visible through the wide_int_storage::operator=
profile) is irange_bitmask::intersect, I suspect

  // If we have two known bits that are incompatible, the resulting
  // bit is undefined.  It is unclear whether we should set the entire
  // range to UNDEFINED, or just a subset of it.  For now, set the
  // entire bitmask to unknown (VARYING).
  if (wi::bit_and (~(m_mask | src.m_mask),
                   m_value ^ src.m_value) != 0)
    {

is quite expensive to evaluate.

It might make sense to implement a wi::not_ior_and_xor_nonzero_p special
case for this (unfortunately wide-int doesn't use expression templates).

Limiting range_from_dom like the following improves compile-time at -O1
from 600s to 200s (tested on the gcc-14 branch).  This should probably
re-use an existing ranger --param that's related or add a new one.
Note that -O -fno-thread-jumps compiles in 30s.  IIRC path-ranger uses
it's own cache it wipes between queries - I don't know how this interacts
with GORI (it hopefully shouldn't recompute things, but I don't know).

diff --git a/gcc/gimple-range-cache.cc b/gcc/gimple-range-cache.cc
index a33b7a73872..47117db0648 100644
--- a/gcc/gimple-range-cache.cc
+++ b/gcc/gimple-range-cache.cc
@@ -1668,6 +1668,7 @@ ranger_cache::range_from_dom (vrange &r, tree name,
basic_block start_bb,
   else
     bb = get_immediate_dominator (CDI_DOMINATORS, start_bb);

+  unsigned depth = 10;
   // Search until a value is found, pushing blocks which may need calculating.
   for ( ; bb; prev_bb = bb, bb = get_immediate_dominator (CDI_DOMINATORS, bb))
     {
@@ -1709,6 +1710,9 @@ ranger_cache::range_from_dom (vrange &r, tree name,
basic_block start_bb,

       if (m_on_entry.get_bb_range (r, name, bb))
        break;
+
+      if (--depth == 0)
+       break;
     }

   if (DEBUG_RANGE_CACHE)

I think we're running into several "layers" of (limited or unlimited)
recursions that compose to O (N^M) behavior here.  In other places of
the compiler we impose a global work limit to avoid this and to allow
one layer to use up the work fully when the others do not need deep
recursion.  Of course that only works if the work can be fairly
distributed.

Note instead of limiting the depth of the DOM walk above you could
also limit the number of blocks added to m_workback.

We hit the join block handling often for the testcase, I think that
when one of the pred_bb->preds has BB as src we can avoid adding
to m_workback since we know there's no edge range on that edge
and thus resolve_dom would union with VARYING?  Thus

@@ -1693,8 +1694,8 @@ ranger_cache::range_from_dom (vrange &r, tree name,
basic_block start_bb,
              edge_iterator ei;
              bool all_dom = true;
              FOR_EACH_EDGE (e, ei, prev_bb->preds)
-               if (e->src != bb
-                   && !dominated_by_p (CDI_DOMINATORS, e->src, bb))
+               if (e->src == bb
+                   || !dominated_by_p (CDI_DOMINATORS, e->src, bb))
                  {
                    all_dom = false;
                    break;

though doing this doesn't help the testcase.  But I see that
resolve_dom eventually recurses to range_from_dom which in this case
doesn't stop at the immediate dominator of prev_bb but again only
eventually at the definition of 'name'.  For the testcase we always
have only two incoming edges but in theory this leads to quadraticness?

Trying to limit this with a hack (not sure when else the stack isnt
empty upon recursion) like the following doensn't help though (in addition
to the above changes), instead it results in a slight slowdown.

@@ -1632,6 +1632,8 @@ ranger_cache::resolve_dom (vrange &r, tree name,
basic_block bb)
   m_on_entry.set_bb_range (name, bb, r);
 }

+static vec<basic_block> rfd_limit = vNULL;
+
 // Get the range of NAME from dominators of BB and return it in R.  Search the
 // dominator tree based on MODE.

@@ -1657,6 +1659,8 @@ ranger_cache::range_from_dom (vrange &r, tree name,
basic_block start_bb,
   // Range on entry to the DEF block should not be queried.
   gcc_checking_assert (start_bb != def_bb);
   unsigned start_limit = m_workback.length ();
+  if (!rfd_limit.is_empty ())
+    def_bb = get_immediate_dominator (CDI_DOMINATORS, rfd_limit.last ());

   // Default value is global range.
   get_global_range (r, name);
@@ -1736,7 +1744,11 @@ ranger_cache::range_from_dom (vrange &r, tree name,
basic_block start_bb,
          // RFD_FILL, then the cache cant be stored to, so don't try.
          // Otherwise this becomes a quadratic timed calculation.
          if (mode == RFD_FILL)
-           resolve_dom (r, name, prev_bb);
+           {
+             rfd_limit.safe_push (prev_bb);
+             resolve_dom (r, name, prev_bb);
+             rfd_limit.pop ();
+           }
          continue;
        }

Reply via email to