On Mon, 9 Nov 2020, Martin Jambor wrote: > Hi, > > this patch modifies the loop invariant pass so that is can operate > only on a single requested loop and its sub-loops and ignore the rest > of the function, much like it currently ignores basic blocks that are > not in any real loop. It then invokes it from within the loop > interchange pass when it successfully swaps two loops. This avoids > the non-LTO -Ofast run-time regressions of 410.bwaves and 503.bwaves_r > (which are 19% and 15% faster than current master on an AMD zen2 > machine) while not introducing a full LIM pass into the pass pipeline. > > I have not modified the LIM data structures, this means that it still > contains vectors indexed by loop->num even though only a single loop > nest is actually processed. I also did not replace the uses of > pre_and_rev_post_order_compute_fn with a function that would count a > postorder only for a given loop. I can of course do so if the > approach is otherwise deemed viable. > > The patch adds one additional global variable requested_loop to the > pass and then at various places behaves differently when it is set. I > was considering storing the fake root loop into it for normal > operation, but since this loop often requires special handling anyway, > I came to the conclusion that the code would actually end up less > straightforward. > > I have bootstrapped and tested the patch on x86_64-linux and a very > similar one on aarch64-linux. I have also tested it by modifying the > tree_ssa_lim function to run loop_invariant_motion_from_loop on each > real outermost loop in a function and this variant also passed > bootstrap and all tests, including dump scans, of all languages. > > I have built the entire SPEC 2006 FPrate monitoring the activity of > the LIM pass without and with the patch (on top of commit b642fca1c31 > with which 526.blender_r and 538.imagick_r seemed to be failing) and > it only examined 0.2% more loops, 0.02% more BBs and even fewer > percent of statements because it is invoked only in a rather special > circumstance. But the patch allows for more such need-based uses at > hopefully reasonable cost. > > Since I do not have much experience with loop optimizers, I expect > that there will be requests to adjust the patch during the review. > Still, it fixes a performance regression against GCC 9 and so I hope > to address the concerns in time to get it into GCC 11. > > Thanks, > > Martin > > > gcc/ChangeLog: > > 2020-11-08 Martin Jambor <mjam...@suse.cz> > > * gimple-loop-interchange.cc (pass_linterchange::execute): Call > loop_invariant_motion_from_loop on affected loop nests. > * tree-ssa-loop-im.c (requested_loop): New variable. > (get_topmost_lim_loop): New function. > (outermost_invariant_loop): Use it, cap discovered topmost loop at > requested_loop. > (determine_max_movement): Use get_topmost_lim_loop. > (set_level): Assert that the selected loop is not outside of > requested_loop. > (compute_invariantness): Do not process loops outside of > requested_loop, if non-NULL. > (move_computations_worker): Likewise. > (mark_ref_stored): Stop iteration at requested_loop, if non-NULL. > (mark_ref_loaded): Likewise. > (analyze_memory_references): If non-NULL, only process basic > blocks and loops in requested_loop. Compute contains_call bitmap. > (do_store_motion): Only process requested_loop if non-NULL. > (fill_always_executed_in): Likewise. Also accept contains_call as > a parameter rather than computing it. > (tree_ssa_lim_initialize): New parameter which is stored into > requested_loop. Additonal dumping. Only initialize > bb_loop_postorder for loops within requested_loop, if non-NULL. > (tree_ssa_lim_finalize): Clear requested_loop, additional dumping. > (loop_invariant_motion_from_loop): New function. > (tree_ssa_lim): Move all functionality to > loop_invariant_motion_from_loop, call it. > * tree-ssa-loop-manip.h (loop_invariant_motion_from_loop): Declare. > > --- > gcc/gimple-loop-interchange.cc | 30 +++++- > gcc/tree-ssa-loop-im.c | 176 ++++++++++++++++++++++++--------- > gcc/tree-ssa-loop-manip.h | 2 + > 3 files changed, 156 insertions(+), 52 deletions(-) > > diff --git a/gcc/gimple-loop-interchange.cc b/gcc/gimple-loop-interchange.cc > index 1656004ecf0..8c376228779 100644 > --- a/gcc/gimple-loop-interchange.cc > +++ b/gcc/gimple-loop-interchange.cc > @@ -2068,6 +2068,7 @@ pass_linterchange::execute (function *fun) > return 0; > > bool changed_p = false; > + auto_vec<class loop *, 4> loops_to_lim; > class loop *loop; > FOR_EACH_LOOP (loop, LI_ONLY_INNERMOST) > { > @@ -2077,7 +2078,11 @@ pass_linterchange::execute (function *fun) > if (prepare_perfect_loop_nest (loop, &loop_nest, &datarefs, &ddrs)) > { > tree_loop_interchange loop_interchange (loop_nest); > - changed_p |= loop_interchange.interchange (datarefs, ddrs); > + if (loop_interchange.interchange (datarefs, ddrs)) > + { > + changed_p = true; > + loops_to_lim.safe_push (loop_nest[0]); > + } > } > free_dependence_relations (ddrs); > free_data_refs_with_aux (datarefs); > @@ -2085,8 +2090,27 @@ pass_linterchange::execute (function *fun) > } > > if (changed_p) > - scev_reset (); > - return changed_p ? (TODO_update_ssa_only_virtuals) : 0; > + { > + unsigned todo = TODO_update_ssa_only_virtuals; > + unsigned len = loops_to_lim.length (); > + for (unsigned i = len - 1; i > 0; i--) > + for (int j = i - 1; j >= 0; j--) > + if (loops_to_lim[j] == loops_to_lim[i] > + || flow_loop_nested_p (loops_to_lim[j], loops_to_lim[i]))
I think this all should never happen so you can remove the pruning of loops_to_lim. > + { > + loops_to_lim.pop (); > + break; > + } > + > + len = loops_to_lim.length (); > + for (unsigned i = 0; i < len; i++) > + todo |= loop_invariant_motion_from_loop (cfun, loops_to_lim[i]); > + > + scev_reset (); > + return todo; > + } > + else > + return 0; > } > > } // anon namespace > diff --git a/gcc/tree-ssa-loop-im.c b/gcc/tree-ssa-loop-im.c > index 6bb07e133cd..24b541dfb17 100644 > --- a/gcc/tree-ssa-loop-im.c > +++ b/gcc/tree-ssa-loop-im.c > @@ -244,6 +244,11 @@ static struct > static bitmap_obstack lim_bitmap_obstack; > static obstack mem_ref_obstack; > > +/* If LIM has been requested only for a particular loop (and all of the > + enclosed loops this variable points to it, otherwise it points to the > dummy > + loop spanning whole function. */ > +static class loop *requested_loop; > + > static bool ref_indep_loop_p (class loop *, im_mem_ref *, dep_kind); > static bool ref_always_accessed_p (class loop *, im_mem_ref *, bool); > static bool refs_independent_p (im_mem_ref *, im_mem_ref *, bool = true); > @@ -410,6 +415,21 @@ movement_possibility (gimple *stmt) > return ret; > } > > +/* Return either the topmost real loop in which LOOP is nested if processing > + the whole function, or requested_loop if not. */ > + > +static class loop * > +get_topmost_lim_loop (class loop *loop) > +{ > + if (requested_loop) > + { > + gcc_assert (loop == requested_loop > + || flow_loop_nested_p (requested_loop, loop)); > + return requested_loop; > + } > + return superloop_at_depth (loop, 1); > +} > + > /* Suppose that operand DEF is used inside the LOOP. Returns the outermost > loop to that we could move the expression using DEF if it did not have > other operands, i.e. the outermost loop enclosing LOOP in that the value > @@ -424,18 +444,18 @@ outermost_invariant_loop (tree def, class loop *loop) > struct lim_aux_data *lim_data; > > if (!def) > - return superloop_at_depth (loop, 1); > + return get_topmost_lim_loop (loop); > > if (TREE_CODE (def) != SSA_NAME) > { > gcc_assert (is_gimple_min_invariant (def)); > - return superloop_at_depth (loop, 1); > + return get_topmost_lim_loop (loop); > } > > def_stmt = SSA_NAME_DEF_STMT (def); > def_bb = gimple_bb (def_stmt); > if (!def_bb) > - return superloop_at_depth (loop, 1); > + return get_topmost_lim_loop (loop); > > max_loop = find_common_loop (loop, def_bb->loop_father); > > @@ -443,6 +463,16 @@ outermost_invariant_loop (tree def, class loop *loop) > if (lim_data != NULL && lim_data->max_loop != NULL) > max_loop = find_common_loop (max_loop, > loop_outer (lim_data->max_loop)); > + if (requested_loop) > + { > + class loop *req_outer = loop_outer (requested_loop); > + if (flow_loop_nested_p (max_loop, req_outer)) > + max_loop = req_outer; > + else > + gcc_assert (max_loop == req_outer > + || flow_loop_nested_p (req_outer, max_loop)); > + } > + > if (max_loop == loop) > return NULL; > max_loop = superloop_at_depth (loop, loop_depth (max_loop) + 1); > @@ -677,7 +707,7 @@ determine_max_movement (gimple *stmt, bool > must_preserve_exec) > if (must_preserve_exec) > level = ALWAYS_EXECUTED_IN (bb); > else > - level = superloop_at_depth (loop, 1); > + level = get_topmost_lim_loop (loop); > lim_data->max_loop = level; > > if (gphi *phi = dyn_cast <gphi *> (stmt)) > @@ -813,6 +843,9 @@ set_level (gimple *stmt, class loop *orig_loop, class > loop *level) > > gcc_assert (level == lim_data->max_loop > || flow_loop_nested_p (lim_data->max_loop, level)); > + gcc_assert (!requested_loop > + || requested_loop == lim_data->max_loop > + || flow_loop_nested_p (requested_loop, level)); > > lim_data->tgt_loop = level; > FOR_EACH_VEC_ELT (lim_data->depends, i, dep_stmt) I'm not sure if the above massaging is really required, we can either at final, for example here - limit motion, or simply go ahead since we already modify code outside of the requested loop (we insert outside of it, after all). > @@ -983,7 +1016,12 @@ compute_invariantness (basic_block bb) > class loop *outermost = ALWAYS_EXECUTED_IN (bb); > struct lim_aux_data *lim_data; > > - if (!loop_outer (bb->loop_father)) > + if (requested_loop) > + { > + if (!flow_bb_inside_loop_p (requested_loop, bb)) > + return; Changing the set of blocks to iterate over is really desired. We want region-based algorithms be O(size-of-region) and the walking alone makes it O(size-of-function) this way. > + } > + else if (!loop_outer (bb->loop_father)) > return; > > if (dump_file && (dump_flags & TDF_DETAILS)) > @@ -1122,7 +1160,12 @@ move_computations_worker (basic_block bb) > struct lim_aux_data *lim_data; > unsigned int todo = 0; > > - if (!loop_outer (bb->loop_father)) > + if (requested_loop) > + { > + if (!flow_bb_inside_loop_p (requested_loop, bb)) > + return todo; > + } > + else if (!loop_outer (bb->loop_father)) > return todo; > > for (gphi_iterator bsi = gsi_start_phis (bb); !gsi_end_p (bsi); ) > @@ -1414,7 +1457,9 @@ set_ref_stored_in_loop (im_mem_ref *ref, class loop > *loop) > static void > mark_ref_stored (im_mem_ref *ref, class loop *loop) > { > - while (loop != current_loops->tree_root > + class loop *stop_loop = requested_loop > + ? loop_outer (requested_loop) : current_loops->tree_root; I guess storing the "root loop" alongside the requested one would simplify this. There's also the old idea of splitting the tree_ssa_lim () operation on the outermost loops in the function which is a refactoring that would reduce peak memory use and make your refactoring a bit more natural given we'd essentially do a for (loop = current_loops->tree_root->inner; loop; loop = loop->next) loop_invariant_motion_from_loop (loop); then for the main pass which is also a good test for proper compile-time behavior of the per-loop operation. > + while (loop != stop_loop > && set_ref_stored_in_loop (ref, loop)) > loop = loop_outer (loop); > } > @@ -1435,7 +1480,9 @@ set_ref_loaded_in_loop (im_mem_ref *ref, class loop > *loop) > static void > mark_ref_loaded (im_mem_ref *ref, class loop *loop) > { > - while (loop != current_loops->tree_root > + class loop *stop_loop = requested_loop > + ? loop_outer (requested_loop) : current_loops->tree_root; > + while (loop != stop_loop > && set_ref_loaded_in_loop (ref, loop)) > loop = loop_outer (loop); > } > @@ -1619,24 +1666,35 @@ sort_locs_in_loop_postorder_cmp (const void *loc1_, > const void *loc2_, > return bb_loop_postorder[loop1->num] < bb_loop_postorder[loop2->num] ? -1 > : 1; > } > > -/* Gathers memory references in loops. */ > +/* Gathers memory references in loops. Set a bit in CONTAINS_CALL > + corresponding to index of each basic block which contains a non-pure call > + statement. */ > > static void > -analyze_memory_references (void) > +analyze_memory_references (sbitmap contains_call) > { > gimple_stmt_iterator bsi; > basic_block bb, *bbs; > class loop *loop, *outer; > unsigned i, n; > > - /* Collect all basic-blocks in loops and sort them after their > - loops postorder. */ > + /* Collect all basic-blocks in loops (either in the whole function or just > in > + the requested loop) and sort them after their loops postorder. */ > i = 0; > - bbs = XNEWVEC (basic_block, n_basic_blocks_for_fn (cfun) - > NUM_FIXED_BLOCKS); > - FOR_EACH_BB_FN (bb, cfun) > - if (bb->loop_father != current_loops->tree_root) > - bbs[i++] = bb; > - n = i; > + if (requested_loop) > + { > + bbs = get_loop_body (requested_loop); > + n = requested_loop->num_nodes; > + } > + else > + { > + bbs = XNEWVEC (basic_block, > + n_basic_blocks_for_fn (cfun) - NUM_FIXED_BLOCKS); > + FOR_EACH_BB_FN (bb, cfun) > + if (bb->loop_father != current_loops->tree_root) > + bbs[i++] = bb; > + n = i; > + } > gcc_sort_r (bbs, n, sizeof (basic_block), sort_bbs_in_loop_postorder_cmp, > bb_loop_postorder); > > @@ -1648,7 +1706,11 @@ analyze_memory_references (void) > { > basic_block bb = bbs[i]; > for (bsi = gsi_start_bb (bb); !gsi_end_p (bsi); gsi_next (&bsi)) > - gather_mem_refs_stmt (bb->loop_father, gsi_stmt (bsi)); > + { > + gather_mem_refs_stmt (bb->loop_father, gsi_stmt (bsi)); > + if (nonpure_call_p (gsi_stmt (bsi))) > + bitmap_set_bit (contains_call, bb->index); > + } > } > > /* Verify the list of gathered memory references is sorted after their > @@ -1667,7 +1729,9 @@ analyze_memory_references (void) > > /* Propagate the information about accessed memory references up > the loop hierarchy. */ > - FOR_EACH_LOOP (loop, LI_FROM_INNERMOST) > + class loop *stop_loop = requested_loop > + ? loop_outer (requested_loop) : current_loops->tree_root; Having that extra "enclosing_loop" global in addition to requested_loop would simplify all of these. > + FOR_EACH_ENCLOSED_LOOP (requested_loop, loop, LI_FROM_INNERMOST) > { > /* Finalize the overall touched references (including subloops). */ > bitmap_ior_into (&memory_accesses.all_refs_stored_in_loop[loop->num], > @@ -1676,7 +1740,7 @@ analyze_memory_references (void) > /* Propagate the information about accessed memory references up > the loop hierarchy. */ > outer = loop_outer (loop); > - if (outer == current_loops->tree_root) > + if (outer == stop_loop) > continue; > > bitmap_ior_into (&memory_accesses.all_refs_stored_in_loop[outer->num], > @@ -2895,8 +2959,11 @@ do_store_motion (void) > class loop *loop; > bitmap sm_executed = BITMAP_ALLOC (&lim_bitmap_obstack); > > - for (loop = current_loops->tree_root->inner; loop != NULL; loop = > loop->next) > - store_motion_loop (loop, sm_executed); > + if (requested_loop) > + store_motion_loop (requested_loop, sm_executed); You should avoid store-motion in the per-loop invariant motion mode since it requires a SSA update which is interently O(size-of-function). That said, in the way it's currently structured I think it's "better" to export tree_ssa_lim () and call it from interchange if any loop was interchanged (thus run a full pass but conditional on interchange done). You can make it cheaper by adding a flag to tree_ssa_lim whether to do store-motion (I guess this might be an interesting user-visible flag as well and a possibility to make select lim passes cheaper via a pass flag) and not do store-motion from the interchange call. I think that's how we should fix the regression, refactoring LIM properly requires more work that doesn't seem to fit the stage1 deadline. Thanks, Richard, > + else > + for (loop = current_loops->tree_root->inner; loop != NULL; loop = > loop->next) > + store_motion_loop (loop, sm_executed); > > BITMAP_FREE (sm_executed); > } > @@ -2982,27 +3049,13 @@ fill_always_executed_in_1 (class loop *loop, sbitmap > contains_call) > of its header implies execution of bb. */ > > static void > -fill_always_executed_in (void) > +fill_always_executed_in (sbitmap contains_call) > { > - basic_block bb; > class loop *loop; > > - auto_sbitmap contains_call (last_basic_block_for_fn (cfun)); > - bitmap_clear (contains_call); > - FOR_EACH_BB_FN (bb, cfun) > - { > - gimple_stmt_iterator gsi; > - for (gsi = gsi_start_bb (bb); !gsi_end_p (gsi); gsi_next (&gsi)) > - { > - if (nonpure_call_p (gsi_stmt (gsi))) > - break; > - } > - > - if (!gsi_end_p (gsi)) > - bitmap_set_bit (contains_call, bb->index); > - } > - > - for (loop = current_loops->tree_root->inner; loop; loop = loop->next) > + for (loop = requested_loop ? requested_loop : > current_loops->tree_root->inner; > + loop; > + loop = loop->next) > fill_always_executed_in_1 (loop, contains_call); > } > > @@ -3010,11 +3063,17 @@ fill_always_executed_in (void) > /* Compute the global information needed by the loop invariant motion pass. > */ > > static void > -tree_ssa_lim_initialize (void) > +tree_ssa_lim_initialize (class loop *req_loop) > { > class loop *loop; > unsigned i; > > + gcc_assert (!requested_loop || requested_loop != current_loops->tree_root); > + requested_loop = req_loop; > + > + if (dump_file && (dump_flags & TDF_DETAILS) && requested_loop) > + fprintf (dump_file, "Initializing LIM for loop %d.\n\n", req_loop->num); > + > bitmap_obstack_initialize (&lim_bitmap_obstack); > gcc_obstack_init (&mem_ref_obstack); > lim_aux_data_map = new hash_map<gimple *, lim_aux_data *>; > @@ -3051,7 +3110,7 @@ tree_ssa_lim_initialize (void) > its postorder index. */ > i = 0; > bb_loop_postorder = XNEWVEC (unsigned, number_of_loops (cfun)); > - FOR_EACH_LOOP (loop, LI_FROM_INNERMOST) > + FOR_EACH_ENCLOSED_LOOP (req_loop, loop, LI_FROM_INNERMOST) > bb_loop_postorder[loop->num] = i++; > } > > @@ -3086,23 +3145,32 @@ tree_ssa_lim_finalize (void) > free_affine_expand_cache (&memory_accesses.ttae_cache); > > free (bb_loop_postorder); > + requested_loop = NULL; > + > + if (dump_file && (dump_flags & TDF_DETAILS)) > + fprintf (dump_file, "LIM finished.\n\n"); > } > > -/* Moves invariants from loops. Only "expensive" invariants are moved out -- > - i.e. those that are likely to be win regardless of the register pressure. > */ > +/* Moves invariants from LOOP in FUN. If LOOP is NULL, perform the > + tranformation on all loops within FUN. Only "expensive" invariants are > + moved out -- i.e. those that are likely to be win regardless of the > register > + pressure. Return the pass TODO flags that need to be carried out after > the > + transformation. */ > > -static unsigned int > -tree_ssa_lim (function *fun) > +unsigned int > +loop_invariant_motion_from_loop (function *fun, class loop *loop) > { > unsigned int todo = 0; > > - tree_ssa_lim_initialize (); > + tree_ssa_lim_initialize (loop); > > + auto_sbitmap contains_call (last_basic_block_for_fn (cfun)); > + bitmap_clear (contains_call); > /* Gathers information about memory accesses in the loops. */ > - analyze_memory_references (); > + analyze_memory_references (contains_call); > > /* Fills ALWAYS_EXECUTED_IN information for basic blocks. */ > - fill_always_executed_in (); > + fill_always_executed_in (contains_call); > > int *rpo = XNEWVEC (int, last_basic_block_for_fn (fun)); > int n = pre_and_rev_post_order_compute_fn (fun, NULL, rpo, false); > @@ -3135,6 +3203,16 @@ tree_ssa_lim (function *fun) > return todo; > } > > +/* Moves invariants from all loops in FUN. Only "expensive" invariants are > + moved out -- i.e. those that are likely to be win regardless of the > register > + pressure. */ > + > +static unsigned int > +tree_ssa_lim (function *fun) > +{ > + return loop_invariant_motion_from_loop (fun, NULL); > +} > + > /* Loop invariant motion pass. */ > > namespace { > diff --git a/gcc/tree-ssa-loop-manip.h b/gcc/tree-ssa-loop-manip.h > index e789e4fcb0b..c89a86aef4e 100644 > --- a/gcc/tree-ssa-loop-manip.h > +++ b/gcc/tree-ssa-loop-manip.h > @@ -56,6 +56,8 @@ extern void tree_unroll_loop (class loop *, unsigned, > edge, class tree_niter_desc *); > extern tree canonicalize_loop_ivs (class loop *, tree *, bool); > > +extern unsigned int loop_invariant_motion_from_loop (function *fun, > + class loop *loop); > > > #endif /* GCC_TREE_SSA_LOOP_MANIP_H */ > -- Richard Biener <rguent...@suse.de> SUSE Software Solutions Germany GmbH, Maxfeldstrasse 5, 90409 Nuernberg, Germany; GF: Felix Imend