Re: [PATCH 2/2] loops: Invoke lim after successful loop interchange

Richard Biener Wed, 11 Nov 2020 01:52:57 -0800

On Mon, 9 Nov 2020, Martin Jambor wrote:

> Hi,
> 
> this patch modifies the loop invariant pass so that is can operate
> only on a single requested loop and its sub-loops and ignore the rest
> of the function, much like it currently ignores basic blocks that are
> not in any real loop.  It then invokes it from within the loop
> interchange pass when it successfully swaps two loops.  This avoids
> the non-LTO -Ofast run-time regressions of 410.bwaves and 503.bwaves_r
> (which are 19% and 15% faster than current master on an AMD zen2
> machine) while not introducing a full LIM pass into the pass pipeline.
> 
> I have not modified the LIM data structures, this means that it still
> contains vectors indexed by loop->num even though only a single loop
> nest is actually processed.  I also did not replace the uses of
> pre_and_rev_post_order_compute_fn with a function that would count a
> postorder only for a given loop.  I can of course do so if the
> approach is otherwise deemed viable.
> 
> The patch adds one additional global variable requested_loop to the
> pass and then at various places behaves differently when it is set.  I
> was considering storing the fake root loop into it for normal
> operation, but since this loop often requires special handling anyway,
> I came to the conclusion that the code would actually end up less
> straightforward.
> 
> I have bootstrapped and tested the patch on x86_64-linux and a very
> similar one on aarch64-linux.  I have also tested it by modifying the
> tree_ssa_lim function to run loop_invariant_motion_from_loop on each
> real outermost loop in a function and this variant also passed
> bootstrap and all tests, including dump scans, of all languages.
> 
> I have built the entire SPEC 2006 FPrate monitoring the activity of
> the LIM pass without and with the patch (on top of commit b642fca1c31
> with which 526.blender_r and 538.imagick_r seemed to be failing) and
> it only examined 0.2% more loops, 0.02% more BBs and even fewer
> percent of statements because it is invoked only in a rather special
> circumstance.  But the patch allows for more such need-based uses at
> hopefully reasonable cost.
> 
> Since I do not have much experience with loop optimizers, I expect
> that there will be requests to adjust the patch during the review.
> Still, it fixes a performance regression against GCC 9 and so I hope
> to address the concerns in time to get it into GCC 11.
> 
> Thanks,
> 
> Martin
> 
> 
> gcc/ChangeLog:
> 
> 2020-11-08  Martin Jambor  <mjam...@suse.cz>
> 
>       * gimple-loop-interchange.cc (pass_linterchange::execute): Call
>       loop_invariant_motion_from_loop on affected loop nests.
>       * tree-ssa-loop-im.c (requested_loop): New variable.
>       (get_topmost_lim_loop): New function.
>       (outermost_invariant_loop): Use it, cap discovered topmost loop at
>       requested_loop.
>       (determine_max_movement): Use get_topmost_lim_loop.
>       (set_level): Assert that the selected loop is not outside of
>       requested_loop.
>       (compute_invariantness): Do not process loops outside of
>       requested_loop, if non-NULL.
>       (move_computations_worker): Likewise.
>       (mark_ref_stored): Stop iteration at requested_loop, if non-NULL.
>       (mark_ref_loaded): Likewise.
>       (analyze_memory_references): If non-NULL, only process basic
>       blocks and loops in requested_loop.  Compute contains_call bitmap.
>       (do_store_motion): Only process requested_loop if non-NULL.
>       (fill_always_executed_in): Likewise.  Also accept contains_call as
>       a parameter rather than computing it.
>       (tree_ssa_lim_initialize): New parameter which is stored into
>       requested_loop.  Additonal dumping. Only initialize
>       bb_loop_postorder for loops within requested_loop, if non-NULL.
>       (tree_ssa_lim_finalize): Clear requested_loop, additional dumping.
>       (loop_invariant_motion_from_loop): New function.
>       (tree_ssa_lim): Move all functionality to
>       loop_invariant_motion_from_loop, call it.
>       * tree-ssa-loop-manip.h (loop_invariant_motion_from_loop): Declare.
> 
> ---
>  gcc/gimple-loop-interchange.cc |  30 +++++-
>  gcc/tree-ssa-loop-im.c         | 176 ++++++++++++++++++++++++---------
>  gcc/tree-ssa-loop-manip.h      |   2 +
>  3 files changed, 156 insertions(+), 52 deletions(-)
> 
> diff --git a/gcc/gimple-loop-interchange.cc b/gcc/gimple-loop-interchange.cc
> index 1656004ecf0..8c376228779 100644
> --- a/gcc/gimple-loop-interchange.cc
> +++ b/gcc/gimple-loop-interchange.cc
> @@ -2068,6 +2068,7 @@ pass_linterchange::execute (function *fun)
>      return 0;
>  
>    bool changed_p = false;
> +  auto_vec<class loop *, 4> loops_to_lim;
>    class loop *loop;
>    FOR_EACH_LOOP (loop, LI_ONLY_INNERMOST)
>      {
> @@ -2077,7 +2078,11 @@ pass_linterchange::execute (function *fun)
>        if (prepare_perfect_loop_nest (loop, &loop_nest, &datarefs, &ddrs))
>       {
>         tree_loop_interchange loop_interchange (loop_nest);
> -       changed_p |= loop_interchange.interchange (datarefs, ddrs);
> +       if (loop_interchange.interchange (datarefs, ddrs))
> +         {
> +           changed_p = true;
> +           loops_to_lim.safe_push (loop_nest[0]);
> +         }
>       }
>        free_dependence_relations (ddrs);
>        free_data_refs_with_aux (datarefs);
> @@ -2085,8 +2090,27 @@ pass_linterchange::execute (function *fun)
>      }
>  
>    if (changed_p)
> -    scev_reset ();
> -  return changed_p ? (TODO_update_ssa_only_virtuals) : 0;
> +    {
> +      unsigned todo = TODO_update_ssa_only_virtuals;
> +      unsigned len  = loops_to_lim.length ();
> +      for (unsigned i = len - 1; i > 0; i--)
> +     for (int j = i - 1; j >= 0; j--)
> +       if (loops_to_lim[j] == loops_to_lim[i]
> +           || flow_loop_nested_p (loops_to_lim[j], loops_to_lim[i]))


I think this all should never happen so you can remove the pruning
of loops_to_lim.

> +         {
> +           loops_to_lim.pop ();
> +           break;
> +         }
> +
> +      len  = loops_to_lim.length ();
> +      for (unsigned i = 0; i < len; i++)
> +     todo |= loop_invariant_motion_from_loop (cfun, loops_to_lim[i]);
> +
> +      scev_reset ();
> +      return todo;
> +    }
> +  else
> +    return 0;
>  }
>  
>  } // anon namespace
> diff --git a/gcc/tree-ssa-loop-im.c b/gcc/tree-ssa-loop-im.c
> index 6bb07e133cd..24b541dfb17 100644
> --- a/gcc/tree-ssa-loop-im.c
> +++ b/gcc/tree-ssa-loop-im.c
> @@ -244,6 +244,11 @@ static struct
>  static bitmap_obstack lim_bitmap_obstack;
>  static obstack mem_ref_obstack;
>  
> +/* If LIM has been requested only for a particular loop (and all of the
> +   enclosed loops this variable points to it, otherwise it points to the 
> dummy
> +   loop spanning whole function.  */
> +static class loop *requested_loop;
> +
>  static bool ref_indep_loop_p (class loop *, im_mem_ref *, dep_kind);
>  static bool ref_always_accessed_p (class loop *, im_mem_ref *, bool);
>  static bool refs_independent_p (im_mem_ref *, im_mem_ref *, bool = true);
> @@ -410,6 +415,21 @@ movement_possibility (gimple *stmt)
>    return ret;
>  }
>  
> +/* Return either the topmost real loop in which LOOP is nested if processing
> +   the whole function, or requested_loop if not.  */
> +
> +static class loop *
> +get_topmost_lim_loop (class loop *loop)
> +{
> +  if (requested_loop)
> +    {
> +      gcc_assert (loop == requested_loop
> +               || flow_loop_nested_p (requested_loop, loop));
> +      return requested_loop;
> +    }
> +  return superloop_at_depth (loop, 1);
> +}
> +
>  /* Suppose that operand DEF is used inside the LOOP.  Returns the outermost
>     loop to that we could move the expression using DEF if it did not have
>     other operands, i.e. the outermost loop enclosing LOOP in that the value
> @@ -424,18 +444,18 @@ outermost_invariant_loop (tree def, class loop *loop)
>    struct lim_aux_data *lim_data;
>  
>    if (!def)
> -    return superloop_at_depth (loop, 1);
> +    return get_topmost_lim_loop (loop);
>  
>    if (TREE_CODE (def) != SSA_NAME)
>      {
>        gcc_assert (is_gimple_min_invariant (def));
> -      return superloop_at_depth (loop, 1);
> +      return get_topmost_lim_loop (loop);
>      }
>  
>    def_stmt = SSA_NAME_DEF_STMT (def);
>    def_bb = gimple_bb (def_stmt);
>    if (!def_bb)
> -    return superloop_at_depth (loop, 1);
> +    return get_topmost_lim_loop (loop);
>  
>    max_loop = find_common_loop (loop, def_bb->loop_father);
>  
> @@ -443,6 +463,16 @@ outermost_invariant_loop (tree def, class loop *loop)
>    if (lim_data != NULL && lim_data->max_loop != NULL)
>      max_loop = find_common_loop (max_loop,
>                                loop_outer (lim_data->max_loop));
> +  if (requested_loop)
> +    {
> +      class loop *req_outer = loop_outer (requested_loop);
> +      if (flow_loop_nested_p (max_loop, req_outer))
> +     max_loop = req_outer;
> +      else
> +     gcc_assert (max_loop == req_outer
> +                 || flow_loop_nested_p (req_outer, max_loop));
> +    }
> +
>    if (max_loop == loop)
>      return NULL;
>    max_loop = superloop_at_depth (loop, loop_depth (max_loop) + 1);
> @@ -677,7 +707,7 @@ determine_max_movement (gimple *stmt, bool 
> must_preserve_exec)
>    if (must_preserve_exec)
>      level = ALWAYS_EXECUTED_IN (bb);
>    else
> -    level = superloop_at_depth (loop, 1);
> +    level = get_topmost_lim_loop (loop);
>    lim_data->max_loop = level;
>  
>    if (gphi *phi = dyn_cast <gphi *> (stmt))
> @@ -813,6 +843,9 @@ set_level (gimple *stmt, class loop *orig_loop, class 
> loop *level)
>  
>    gcc_assert (level == lim_data->max_loop
>             || flow_loop_nested_p (lim_data->max_loop, level));
> +  gcc_assert (!requested_loop
> +           || requested_loop == lim_data->max_loop
> +           || flow_loop_nested_p (requested_loop, level));
>  
>    lim_data->tgt_loop = level;
>    FOR_EACH_VEC_ELT (lim_data->depends, i, dep_stmt)

I'm not sure if the above massaging is really required, we can
either at final, for example here - limit motion, or simply
go ahead since we already modify code outside of the requested
loop (we insert outside of it, after all).

> @@ -983,7 +1016,12 @@ compute_invariantness (basic_block bb)
>    class loop *outermost = ALWAYS_EXECUTED_IN (bb);
>    struct lim_aux_data *lim_data;
>  
> -  if (!loop_outer (bb->loop_father))
> +  if (requested_loop)
> +    {
> +      if (!flow_bb_inside_loop_p (requested_loop, bb))
> +     return;

Changing the set of blocks to iterate over is really desired.
We want region-based algorithms be O(size-of-region) and the
walking alone makes it O(size-of-function) this way.

> +    }
> +  else if (!loop_outer (bb->loop_father))
>      return;
>  
>    if (dump_file && (dump_flags & TDF_DETAILS))
> @@ -1122,7 +1160,12 @@ move_computations_worker (basic_block bb)
>    struct lim_aux_data *lim_data;
>    unsigned int todo = 0;
>  
> -  if (!loop_outer (bb->loop_father))
> +  if (requested_loop)
> +    {
> +      if (!flow_bb_inside_loop_p (requested_loop, bb))
> +     return todo;
> +    }
> +  else if (!loop_outer (bb->loop_father))
>      return todo;
>  
>    for (gphi_iterator bsi = gsi_start_phis (bb); !gsi_end_p (bsi); )
> @@ -1414,7 +1457,9 @@ set_ref_stored_in_loop (im_mem_ref *ref, class loop 
> *loop)
>  static void
>  mark_ref_stored (im_mem_ref *ref, class loop *loop)
>  {
> -  while (loop != current_loops->tree_root
> +  class loop *stop_loop = requested_loop
> +    ? loop_outer (requested_loop) : current_loops->tree_root;

I guess storing the "root loop" alongside the requested one would
simplify this.  There's also the old idea of splitting the
tree_ssa_lim () operation on the outermost loops in the function
which is a refactoring that would reduce peak memory use and
make your refactoring a bit more natural given we'd essentially
do a

  for (loop = current_loops->tree_root->inner; loop; loop = loop->next)
    loop_invariant_motion_from_loop (loop);

then for the main pass which is also a good test for proper
compile-time behavior of the per-loop operation.

> +  while (loop != stop_loop
>        && set_ref_stored_in_loop (ref, loop))
>      loop = loop_outer (loop);
>  }
> @@ -1435,7 +1480,9 @@ set_ref_loaded_in_loop (im_mem_ref *ref, class loop 
> *loop)
>  static void
>  mark_ref_loaded (im_mem_ref *ref, class loop *loop)
>  {
> -  while (loop != current_loops->tree_root
> +  class loop *stop_loop = requested_loop
> +    ? loop_outer (requested_loop) : current_loops->tree_root;
> +  while (loop != stop_loop
>        && set_ref_loaded_in_loop (ref, loop))
>      loop = loop_outer (loop);
>  }
> @@ -1619,24 +1666,35 @@ sort_locs_in_loop_postorder_cmp (const void *loc1_, 
> const void *loc2_,
>    return bb_loop_postorder[loop1->num] < bb_loop_postorder[loop2->num] ? -1 
> : 1;
>  }
>  
> -/* Gathers memory references in loops.  */
> +/* Gathers memory references in loops.  Set a bit in CONTAINS_CALL
> +   corresponding to index of each basic block which contains a non-pure call
> +   statement.  */
>  
>  static void
> -analyze_memory_references (void)
> +analyze_memory_references (sbitmap contains_call)
>  {
>    gimple_stmt_iterator bsi;
>    basic_block bb, *bbs;
>    class loop *loop, *outer;
>    unsigned i, n;
>  
> -  /* Collect all basic-blocks in loops and sort them after their
> -     loops postorder.  */
> +  /* Collect all basic-blocks in loops (either in the whole function or just 
> in
> +     the requested loop) and sort them after their loops postorder.  */
>    i = 0;
> -  bbs = XNEWVEC (basic_block, n_basic_blocks_for_fn (cfun) - 
> NUM_FIXED_BLOCKS);
> -  FOR_EACH_BB_FN (bb, cfun)
> -    if (bb->loop_father != current_loops->tree_root)
> -      bbs[i++] = bb;
> -  n = i;
> +  if (requested_loop)
> +    {
> +      bbs = get_loop_body (requested_loop);
> +      n = requested_loop->num_nodes;
> +    }
> +  else
> +    {
> +      bbs = XNEWVEC (basic_block,
> +                  n_basic_blocks_for_fn (cfun) - NUM_FIXED_BLOCKS);
> +      FOR_EACH_BB_FN (bb, cfun)
> +     if (bb->loop_father != current_loops->tree_root)
> +       bbs[i++] = bb;
> +      n = i;
> +    }
>    gcc_sort_r (bbs, n, sizeof (basic_block), sort_bbs_in_loop_postorder_cmp,
>             bb_loop_postorder);
>  
> @@ -1648,7 +1706,11 @@ analyze_memory_references (void)
>      {
>        basic_block bb = bbs[i];
>        for (bsi = gsi_start_bb (bb); !gsi_end_p (bsi); gsi_next (&bsi))
> -        gather_mem_refs_stmt (bb->loop_father, gsi_stmt (bsi));
> +     {
> +       gather_mem_refs_stmt (bb->loop_father, gsi_stmt (bsi));
> +       if (nonpure_call_p (gsi_stmt (bsi)))
> +         bitmap_set_bit (contains_call, bb->index);
> +     }
>      }
>  
>    /* Verify the list of gathered memory references is sorted after their
> @@ -1667,7 +1729,9 @@ analyze_memory_references (void)
>  
>    /* Propagate the information about accessed memory references up
>       the loop hierarchy.  */
> -  FOR_EACH_LOOP (loop, LI_FROM_INNERMOST)
> +  class loop *stop_loop = requested_loop
> +    ? loop_outer (requested_loop) : current_loops->tree_root;

Having that extra "enclosing_loop" global in addition to requested_loop
would simplify all of these.

> +  FOR_EACH_ENCLOSED_LOOP (requested_loop, loop, LI_FROM_INNERMOST)
>      {
>        /* Finalize the overall touched references (including subloops).  */
>        bitmap_ior_into (&memory_accesses.all_refs_stored_in_loop[loop->num],
> @@ -1676,7 +1740,7 @@ analyze_memory_references (void)
>        /* Propagate the information about accessed memory references up
>        the loop hierarchy.  */
>        outer = loop_outer (loop);
> -      if (outer == current_loops->tree_root)
> +      if (outer == stop_loop)
>       continue;
>  
>        bitmap_ior_into (&memory_accesses.all_refs_stored_in_loop[outer->num],
> @@ -2895,8 +2959,11 @@ do_store_motion (void)
>    class loop *loop;
>    bitmap sm_executed = BITMAP_ALLOC (&lim_bitmap_obstack);
>  
> -  for (loop = current_loops->tree_root->inner; loop != NULL; loop = 
> loop->next)
> -    store_motion_loop (loop, sm_executed);
> +  if (requested_loop)
> +    store_motion_loop (requested_loop, sm_executed);

You should avoid store-motion in the per-loop invariant motion mode
since it requires a SSA update which is interently O(size-of-function).

That said, in the way it's currently structured I think it's
"better" to export tree_ssa_lim () and call it from interchange
if any loop was interchanged (thus run a full pass but conditional
on interchange done).  You can make it cheaper by adding a flag
to tree_ssa_lim whether to do store-motion (I guess this might
be an interesting user-visible flag as well and a possibility
to make select lim passes cheaper via a pass flag) and not do
store-motion from the interchange call.  I think that's how we should
fix the regression, refactoring LIM properly requires more work
that doesn't seem to fit the stage1 deadline.

Thanks,
Richard,



> +  else
> +    for (loop = current_loops->tree_root->inner; loop != NULL; loop = 
> loop->next)
> +      store_motion_loop (loop, sm_executed);
>  
>    BITMAP_FREE (sm_executed);
>  }
> @@ -2982,27 +3049,13 @@ fill_always_executed_in_1 (class loop *loop, sbitmap 
> contains_call)
>     of its header implies execution of bb.  */
>  
>  static void
> -fill_always_executed_in (void)
> +fill_always_executed_in (sbitmap contains_call)
>  {
> -  basic_block bb;
>    class loop *loop;
>  
> -  auto_sbitmap contains_call (last_basic_block_for_fn (cfun));
> -  bitmap_clear (contains_call);
> -  FOR_EACH_BB_FN (bb, cfun)
> -    {
> -      gimple_stmt_iterator gsi;
> -      for (gsi = gsi_start_bb (bb); !gsi_end_p (gsi); gsi_next (&gsi))
> -     {
> -       if (nonpure_call_p (gsi_stmt (gsi)))
> -         break;
> -     }
> -
> -      if (!gsi_end_p (gsi))
> -     bitmap_set_bit (contains_call, bb->index);
> -    }
> -
> -  for (loop = current_loops->tree_root->inner; loop; loop = loop->next)
> +  for (loop = requested_loop ? requested_loop : 
> current_loops->tree_root->inner;
> +       loop;
> +       loop = loop->next)
>      fill_always_executed_in_1 (loop, contains_call);
>  }
>  
> @@ -3010,11 +3063,17 @@ fill_always_executed_in (void)
>  /* Compute the global information needed by the loop invariant motion pass.  
> */
>  
>  static void
> -tree_ssa_lim_initialize (void)
> +tree_ssa_lim_initialize (class loop *req_loop)
>  {
>    class loop *loop;
>    unsigned i;
>  
> +  gcc_assert (!requested_loop || requested_loop != current_loops->tree_root);
> +  requested_loop = req_loop;
> +
> +  if (dump_file && (dump_flags & TDF_DETAILS) && requested_loop)
> +    fprintf (dump_file, "Initializing LIM for loop %d.\n\n", req_loop->num);
> +
>    bitmap_obstack_initialize (&lim_bitmap_obstack);
>    gcc_obstack_init (&mem_ref_obstack);
>    lim_aux_data_map = new hash_map<gimple *, lim_aux_data *>;
> @@ -3051,7 +3110,7 @@ tree_ssa_lim_initialize (void)
>       its postorder index.  */
>    i = 0;
>    bb_loop_postorder = XNEWVEC (unsigned, number_of_loops (cfun));
> -  FOR_EACH_LOOP (loop, LI_FROM_INNERMOST)
> +  FOR_EACH_ENCLOSED_LOOP (req_loop, loop, LI_FROM_INNERMOST)
>      bb_loop_postorder[loop->num] = i++;
>  }
>  
> @@ -3086,23 +3145,32 @@ tree_ssa_lim_finalize (void)
>      free_affine_expand_cache (&memory_accesses.ttae_cache);
>  
>    free (bb_loop_postorder);
> +  requested_loop = NULL;
> +
> +  if (dump_file && (dump_flags & TDF_DETAILS))
> +    fprintf (dump_file, "LIM finished.\n\n");
>  }
>  
> -/* Moves invariants from loops.  Only "expensive" invariants are moved out --
> -   i.e. those that are likely to be win regardless of the register pressure. 
>  */
> +/* Moves invariants from LOOP in FUN.  If LOOP is NULL, perform the
> +   tranformation on all loops within FUN.  Only "expensive" invariants are
> +   moved out -- i.e. those that are likely to be win regardless of the 
> register
> +   pressure.  Return the pass TODO flags that need to be carried out after 
> the
> +   transformation.  */
>  
> -static unsigned int
> -tree_ssa_lim (function *fun)
> +unsigned int
> +loop_invariant_motion_from_loop (function *fun, class loop *loop)
>  {
>    unsigned int todo = 0;
>  
> -  tree_ssa_lim_initialize ();
> +  tree_ssa_lim_initialize (loop);
>  
> +  auto_sbitmap contains_call (last_basic_block_for_fn (cfun));
> +  bitmap_clear (contains_call);
>    /* Gathers information about memory accesses in the loops.  */
> -  analyze_memory_references ();
> +  analyze_memory_references (contains_call);
>  
>    /* Fills ALWAYS_EXECUTED_IN information for basic blocks.  */
> -  fill_always_executed_in ();
> +  fill_always_executed_in (contains_call);
>  
>    int *rpo = XNEWVEC (int, last_basic_block_for_fn (fun));
>    int n = pre_and_rev_post_order_compute_fn (fun, NULL, rpo, false);
> @@ -3135,6 +3203,16 @@ tree_ssa_lim (function *fun)
>    return todo;
>  }
>  
> +/* Moves invariants from all loops in FUN.  Only "expensive" invariants are
> +   moved out -- i.e. those that are likely to be win regardless of the 
> register
> +   pressure.  */
> +
> +static unsigned int
> +tree_ssa_lim (function *fun)
> +{
> +  return loop_invariant_motion_from_loop (fun, NULL);
> +}
> +
>  /* Loop invariant motion pass.  */
>  
>  namespace {
> diff --git a/gcc/tree-ssa-loop-manip.h b/gcc/tree-ssa-loop-manip.h
> index e789e4fcb0b..c89a86aef4e 100644
> --- a/gcc/tree-ssa-loop-manip.h
> +++ b/gcc/tree-ssa-loop-manip.h
> @@ -56,6 +56,8 @@ extern void tree_unroll_loop (class loop *, unsigned,
>                             edge, class tree_niter_desc *);
>  extern tree canonicalize_loop_ivs (class loop *, tree *, bool);
>  
> +extern unsigned int loop_invariant_motion_from_loop (function *fun,
> +                                                  class loop *loop);
>  
>  
>  #endif /* GCC_TREE_SSA_LOOP_MANIP_H */
> 

-- 
Richard Biener <rguent...@suse.de>
SUSE Software Solutions Germany GmbH, Maxfeldstrasse 5, 90409 Nuernberg,
Germany; GF: Felix Imend

Re: [PATCH 2/2] loops: Invoke lim after successful loop interchange

Reply via email to