https://gcc.gnu.org/bugzilla/show_bug.cgi?id=106514

--- Comment #6 from Richard Biener <rguenth at gcc dot gnu.org> ---
So one now needs to bump the limit to 60 to get enough samples for perf.  Then
we now see

Samples: 55K of event 'cycles:u', Event count (approx.): 49013411833            
Overhead       Samples  Command  Shared Object     Symbol                       
  51.19%         28195  cc1      cc1               [.]
path_range_query::compute_ranges_in_block
  11.67%          6427  cc1      cc1               [.]
path_range_query::adjust_for_non_null_uses
   9.20%          5069  cc1      cc1               [.]
path_range_query::range_defined_in_block
   3.39%          1869  cc1      cc1               [.] bitmap_set_bit
   1.95%          1072  cc1      cc1               [.]
back_threader::find_paths_to_names
   1.93%          1066  cc1      cc1               [.] bitmap_bit_p

the compute_ranges_in_block is also top with 30 but adjust_for_non_null_uses
pops up newly with 60.

The compute_ranges_in_block slowness is attributed to

  // ...and then the rest of the imports.
  EXECUTE_IF_SET_IN_BITMAP (m_imports, 0, i, bi)
    {
      tree name = ssa_name (i);
      Value_Range r (TREE_TYPE (name));

      if (gimple_code (SSA_NAME_DEF_STMT (name)) != GIMPLE_PHI
          && range_defined_in_block (r, name, bb))
^^^^

plus

  gori_compute &g = m_ranger->gori ();
  bitmap exports = g.exports (bb);
  EXECUTE_IF_AND_IN_BITMAP (m_imports, exports, 0, i, bi)
    {
      tree name = ssa_name (i);
      Value_Range r (TREE_TYPE (name));
      if (g.outgoing_edge_range_p (r, e, name, *this))
^^^^

for this testcase there seem to be a lot of imports but not many exports
so range_defined_in_block is called very many times compared to
outgoing_edge_range_p but the latter is comparatively more expensive.

For the path query I wonder why we are interested in computing (aka
updating the cache) for any but the exports?  When we
compute the exports, why is the cache not lazily computed just for
the interesting names?  AFAICS we invalidate all local defs (but even
then, why?  we get to see a def exactly once, why do we have to even
think about clearing sth we should not have seen?)

That is, in path_range_query::compute_ranges

  while (1)
    {
      basic_block bb = curr_bb ();

      compute_ranges_in_block (bb);
      adjust_for_non_null_uses (bb);

      if (at_exit ())
        break;

      move_next ();
    }

I'd expect only a small portion of the actual compute_ranges_in_block
work to be done for all blocks and the real resolving work only for
the block ending the path?  Maybe the backwards threader is just using
the wrong (expensive) API here?  It does

 m_solver->compute_ranges (path, m_imports);
 m_solver->range_of_stmt (r, cond);


--

Btw, I wondered if path-range-query can handle parts of the path being a
"black box", aka, skip to the immediate dominator instead of one of the
predecessor edges?  I _think_ analysis wise this would be quite straight
forward but of course we'd have to represent this somehow in the path.
Maybe it works by simply leaving out the intermediate blocks?  Thus,

   B
   |\
   A
  / \
 C   D
  \ /
   E
    \

the path would be from B to E but we don't care whether we go the C or D
way, and when duplicating the path we'd simply duplicate the whole diamond
instead of duplicating only one branch, say A->D, and keeping the edge
A->C to the original block C, defeating the threading of E to its successor
if we happen to go that way.

Reply via email to