On Mon, 30 Aug 2021, Xionghu Luo wrote:
>
>
> On 2021/8/27 15:45, Richard Biener wrote:
> > On Thu, 26 Aug 2021, Xionghu Luo wrote:
> >
> >>
> >>
> >> On 2021/8/24 16:20, Richard Biener wrote:
> >>> On Tue, 24 Aug 2021, Xionghu Luo wrote:
> >>>
> >>>>
> >>>>
> >>>> On 2021/8/19 20:11, Richard Biener wrote:
> >>>>>> - class loop *inn_loop = loop;
> >>>>>>
> >>>>>> if (ALWAYS_EXECUTED_IN (loop->header) == NULL)
> >>>>>> {
> >>>>>> @@ -3232,19 +3231,6 @@ fill_always_executed_in_1 (class loop *loop,
> >>>>>> @@ sbitmap contains_call)
> >>>>>> to disprove this if possible). */
> >>>>>> if (bb->flags & BB_IRREDUCIBLE_LOOP)
> >>>>>> break;
> >>>>>> -
> >>>>>> - if (!flow_bb_inside_loop_p (inn_loop, bb))
> >>>>>> - break;
> >>>>>> -
> >>>>>> - if (bb->loop_father->header == bb)
> >>>>>> - {
> >>>>>> - if (!dominated_by_p (CDI_DOMINATORS, loop->latch, bb))
> >>>>>> - break;
> >>>>>> -
> >>>>>> - /* In a loop that is always entered we may proceed
> >>>>>> anyway.
> >>>>>> - But record that we entered it and stop once we leave
> >>>>>> it. */
> >>>>>> - inn_loop = bb->loop_father;
> >>>>>> - }
> >>>>>> }
> >>>>>>
> >>>>>> while (1)
> >>>>> I'm not sure this will work correct (I'm not sure how the existing
> >>>>> code makes it so either...). That said, I can't poke any hole
> >>>>> into the change. What I see is that definitely
> >>>>>
> >>>>> if (dominated_by_p (CDI_DOMINATORS, loop->latch, bb))
> >>>>> last = bb;
> >>>>>
> >>>>> if (bitmap_bit_p (contains_call, bb->index))
> >>>>> break;
> >>>>>
> >>>>> doesn't work reliably since the DOM ordering will process blocks
> >>>>> A B and C in random order for
> >>>>>
> >>>>> for (;;)
> >>>>> {
> >>>>> if (cond)
> >>>>> {
> >>>>> A: foo ();
> >>>>> }
> >>>>> else B:;
> >>>>> C:;
> >>>>> }
> >>>>>
> >>>>> and thus we can end up setting 'last' to C_before_ processing
> >>>>> 'A' and thus arriving at the call foo () ...
> >>>>>
> >>>>> get_loop_body_in_dom_order does some "special sauce" but not
> >>>>> to address the above problem - but it might be that a subtle
> >>>>> issue like the above is the reason for the inner loop handling.
> >>>>> The inner loop block order does_not_ adhere to this "special sauce",
> >>>>> that is - the "Additionally, if a basic block s dominates
> >>>>> the latch, then only blocks dominated by s are be after it."
> >>>>> guarantee holds for the outer loop latch, not for the inner.
> >>>>>
> >>>>> Digging into the history of fill_always_executed_in_1 doesn't
> >>>>> reveal anything - the inner loop handling has been present
> >>>>> since introduction by Zdenek - but usually Zdenek has a reason
> >>>>> for doing things as he does;)
> >>>>
> >>>> Yes, this is really complicated usage, thanks for point it out. :)
> >>>> I constructed two cases to verify this with inner loop includes "If A;
> >>>> else B; C".
> >>>> Finding that fill_sons_in_loop in get_loop_body_in_dom_order will also
> >>>> checks
> >>>> whether the bb domintes outer loop’s latch, if C dominate outer loop’s
> >>>> latch,
> >>>> C is postponed, the access order is ABC, 'last' won’t be set to C if A or
> >>>> B contains call;
> >>>
> >>> But it depends on the order of visiting ABC and that's hard to put into
> >>> a testcase since it depends on the order of edges and the processing
> >>> of the dominance computation. ABC are simply unordered with respect
> >>> to a dominator walk.
> >>>
> >>>> Otherwise if C doesn’t dominate outer loop’s latch in fill_sons_in_loop,
> >>>> the access order is CAB, but 'last' also won’t be updated to C in
> >>>> fill_always_executed_in_1
> >>>> since there is also dominate check, then if A or B contains call, it
> >>>> could break
> >>>> successfully.
> >>>>
> >>>> C won't be set to ALWAYS EXECUTED for both circumstance.
> >>>>
> >>>>>
> >>>>> Note it might be simply a measure against quadratic complexity,
> >>>>> esp. since with your patch we also dive into not always executed
> >>>>> subloops as you remove the
> >>>>>
> >>>>> if (!dominated_by_p (CDI_DOMINATORS, loop->latch, bb))
> >>>>> break;
> >>>>>
> >>>>> check. I suggest to evaluate behavior of the patch on a testcase
> >>>>> like
> >>>>>
> >>>>> void foo (int n, int **k)
> >>>>> {
> >>>>> for (int i = 0; i < n; ++i)
> >>>>> if (k[0][i])
> >>>>> for (int j = 0; j < n; ++j)
> >>>>> if (k[1][j])
> >>>>> for (int l = 0; l < n; ++l)
> >>>>> if (k[2][l])
> >>>>> ...
> >>>>> }
> >>>>
> >>>> Theoretically the complexity is changing from L1(bbs) to
> >>>> L1(bbs)+L2(bbs)+L3(bbs)+…+Ln(bbs),
> >>>> so fill_always_executed_in_1's execution time is supposed to be increase
> >>>> from
> >>>> O(n) to O(n2)? The time should depend on loop depth and bb counts. I
> >>>> also drafted a
> >>>> test case has 73-depth loop function with 25 no-ipa function copies each
> >>>> compiled
> >>>> in lim2 and lim4 dependently. Total execution time of
> >>>> fill_always_executed_in_1 is
> >>>> increased from 32ms to 58ms, almost doubled but not quadratic?
> >>>
> >>> It's more like n + (n-1) + (n-2) + ... + 1 which is n^2/2 but that's still
> >>> O(n^2).
> >>>
> >>>> It seems reasonable to see compiling time getting longer since most bbs
> >>>> are checked
> >>>> more but a MUST to ensure early break correctly in every loop level...
> >>>> Though loop nodes could be huge, loop depth will never be so large in
> >>>> actual code?
> >>>
> >>> The "in practice" argument is almost always defeated by automatic
> >>> program generators ;)
> >>>
> >>>>> I suspect you'll see quadratic behavior with your patch. You
> >>>>> should be at least able to preserve a check like
> >>>>>
> >>>>> /* Do not process not always executed subloops to avoid
> >>>>> quadratic behavior. */
> >>>>> if (bb->loop_father->header == bb
> >>>>> && !dominated_by_p (CDI_DOMINATORS, loop->latch, bb))
> >>>>> break;
> >>>>>
> >>>>> which is of course not optimistic for cases like
> >>>>>
> >>>>> for (..)
> >>>>> {
> >>>>> if (cond)
> >>>>> for (..)
> >>>>> x = 1; // this is always executed if the inner loop is finite
> >>>>> }
> >>>>>
> >>>>> but we need to have an eye on the complexity of this function. I would
> >>>>> have suggested to do greedy visiting of the loop header successors,
> >>>>> processing further blocks if all entries (ignoring backedges) are
> >>>>> processed, setting SET_ALWAYS_EXECUTED_IN. When the worklist
> >>>>> is empty proceed to inner loops as the current code does. For
> >>>>> bookkeeping simply keep a to-visit-incoming-edges counter per BB.
> >>>>> Pseudo-code:
> >>>>>
> >>>>> bitmap_set_bit (worklist, loop-header-bb);
> >>>>> while (!bitmap_empty_p (worklist))
> >>>>> {
> >>>>> bb = pop (worklist);
> >>>>
> >>>> Need check whether bb dominates latch before SET_ALWAYS_EXECUTED_IN?
> >>>
> >>> Ah, sure.
> >>>
> >>>>> SET_ALWAYS_EXECUTED_IN (bb, loop);
> >>>>> if (bitmap_bit_p (contains_call, bb->index))
> >>>>> continue;
> >>>>> FOR_EACH_EDGE (e, ei, bb->succs)
> >>>>> {
> >>>>> if (!flow_bb_inside_loop_p (loop, e->dest))
> >>>>> continue;
> >>>>> if (incoming_count[e->dest->index]-- == 0)
> >>>>> push (worklist, e->dest);
> >>>>> }
> >>>>> }
> >>>>
> >>>> Sorry I don't fully understand your algorithm. worklist should be
> >>>> auto_vec<basic_block> don't support bitmap operations? Is incoming_count
> >>>> the bb's preds count, why need retain it since it it decreased to 0?
> >>>
> >>> the worklist is a auto_bitmap, pop () would be
> >>> bitmap_first_set_bit/clear_bit. One could use a vector with the
> >>> caveat of eventually adding duplicate members to the worklist.
> >>> But as said, it's pseudo-code ;)
> >>
> >> Thanks a lot!
> >>
> >>>
> >>>>>
> >>>>> iterate over inner loops (incoming_count can be retained,
> >>>>> we just force the inner loop header onto the worklist).
> >>>>
> >>>> Is this same to ?
> >>>>
> >>>> for (loop = loop->inner; loop; loop = loop->next)
> >>>> fill_always_executed_in_1 (loop, contains_call)
> >>>
> >>> Yes, but I'd simply use for (loop : loops_list (cfun, 0)) which
> >>> should iterate over loops in PRE order.
> >>
> >> Tried implementation with your pseudo-code, it works like the
> >> previous v2 solution, bb after inner loops could be marked
> >> as ALWAYS EXECUTED. You are really so strong! ;)
> >> Regression tested pass on P8LE.
> >>
> >> Assume,
> >>
> >> A: GCC Base
> >> B: inner loop check removal
> >> C: incoming count
> >>
> >> Execution time of fill_always_executed_in_1 for the 73-depth loop
> >> (with five function copies) is
> >> (A vs. B vs. C): 32ms vs 58ms vs 38ms. Much better than O(n^2)?
>
> Sorry didn't get C's execution time accurate enough due to forgot to remove
> dump file code in it.
> C's execution time is 25 ms after code refine. It's even better than A.
> Attached the test case ssa-lim-22.c used for time measurement.
>
> >>
> >>
> >> Bumped the patch to v3 with TODOs:
> >> 1. The time measurement code will be removed if the v3 is OK;
> >> 2. loops_list is not used as my code is not rebased to upstream yet.
> >> 3. Function comments is not updated.
> >>
> >>
> >>
> >> [PATCH v3] Fix incomplete computation in fill_always_executed_in_1
> >>
> >>
> >> ALWAYS_EXECUTED_IN is not computed completely for nested loops.
> >> Current design will exit if an inner loop doesn't dominate outer
> >> loop's latch or exit after exiting from inner loop, which
> >> caused early return from outer loop, then ALWAYS EXECUTED blocks after
> >> inner loops are skipped. Another problem is it has to call
> >> get_loop_body_in_dom_order in each recursive call which is also
> >> inefficient.
> >>
> >> This patch does greedy visiting of the loop header successors,
> >> processing further blocks if all entries (ignoring backedges) are
> >> processed, setting SET_ALWAYS_EXECUTED_IN of dominates loop's latch.
> >> When the worklist is empty proceed to inner loops as the current
> >> code does. For bookkeeping simply keep a to-visit-incoming-edges
> >> counter per BB, and pass it through iteration over inner loops.
> >>
> >> Details discusions:
> >> https://gcc.gnu.org/pipermail/gcc-patches/2021-August/577743.html
> >>
> >> gcc/ChangeLog:
> >>
> >> * tree-ssa-loop-im.c (fill_always_executed_in_1): Remove
> >> inn_loop check.
> >>
> >> gcc/testsuite/ChangeLog:
> >>
> >> * gcc.dg/tree-ssa/ssa-lim-19.c: New test.
> >> * gcc.dg/tree-ssa/ssa-lim-20.c: New test.
> >> ---
> >> gcc/tree-ssa-loop-im.c | 95 +++++++++++++---------
> >> gcc/testsuite/gcc.dg/tree-ssa/ssa-lim-19.c | 24 ++++++
> >> gcc/testsuite/gcc.dg/tree-ssa/ssa-lim-20.c | 30 +++++++
> >> 3 files changed, 111 insertions(+), 38 deletions(-)
> >> create mode 100644 gcc/testsuite/gcc.dg/tree-ssa/ssa-lim-19.c
> >> create mode 100644 gcc/testsuite/gcc.dg/tree-ssa/ssa-lim-20.c
> >>
> >> diff --git a/gcc/tree-ssa-loop-im.c b/gcc/tree-ssa-loop-im.c
> >> index b24bc64f2a7..72be6b6bfac 100644
> >> --- a/gcc/tree-ssa-loop-im.c
> >> +++ b/gcc/tree-ssa-loop-im.c
> >> @@ -3192,39 +3192,52 @@ do_store_motion (void)
> >> blocks that contain a nonpure call. */
> >>
> >> static void
> >> -fill_always_executed_in_1 (class loop *loop, sbitmap contains_call)
> >> +fill_always_executed_in_1 (class loop *loop, sbitmap contains_call, int
> >> *bbi)
> >> {
> >> - basic_block bb = NULL, *bbs, last = NULL;
> >> - unsigned i;
> >> + basic_block bb = NULL;
> >> edge e;
> >> - class loop *inn_loop = loop;
> >>
> >> if (ALWAYS_EXECUTED_IN (loop->header) == NULL)
> >> {
> >> - bbs = get_loop_body_in_dom_order (loop);
> >> + auto_bitmap work_set;
> >> + bitmap_clear (work_set);
> >
> > bitmap_clear isn't necessary
>
> Removed.
>
> >
> >> + bitmap_set_bit (work_set, loop->header->index);
> >> + unsigned bb_index;
> >> - for (i = 0; i < loop->num_nodes; i++)
> >> - {
> >> - edge_iterator ei;
> >> - bb = bbs[i];
> >> + unsigned array_size = last_basic_block_for_fn (cfun) + 1;
> >> + int *bbd = XNEWVEC (int, array_size);
> >> + bbd = XDUPVEC (int, bbi, array_size);
> >
> > I don't think you need to copy 'bbi' but you can re-use the
> > state from the outer loop processing. Did you run into any
> > issues with that?
>
> Yes. For example, adding a small if-else block to ssa-lim-19.c,
> Then block "x->j += tem * i;" of bb 6 is always executed for loop 2, when call
> fill_always_executed_in_1 for loop 1, bbi[6] is decreased from 2 to 1 to 0,
> then if fill_always_executed_in_1 is called again for loop 2, it's value is
> not
> reset so bbi[6] won't be set ALWAYS EXECUTE, this is wrong.
>
>
> struct X { int i; int j; int k;};
>
> void foo(struct X *x, int n, int l, int m)
> {
> for (int j = 0; j < l; j++) // loop 1
> {
> for (int i = 0; i < n; ++i) // loop 2
> {
> if (m)
> x->j++;
> else
> x->j = m+n+l;
>
> int *p = &x->j; // bb 6
> int tem = *p;
> x->j += tem * i;
> }
> int *r = &x->k;
> int tem2 = *r;
> x->k += tem2 * j;
> }
> }
Hmm, but if the outer loop processing reaches bb 6 then
it should have set it ALWAYS_EXECUTED in loop 1 already?
>
>
> >
> >> + while (!bitmap_empty_p (work_set))
> >> + {
> >> + bb_index = bitmap_first_set_bit (work_set);
> >> + bitmap_clear_bit (work_set, bb_index);
> >> + bb = BASIC_BLOCK_FOR_FN (cfun, bb_index);
> >> if (dominated_by_p (CDI_DOMINATORS, loop->latch, bb))
> >> - last = bb;
> >> -
> >> + SET_ALWAYS_EXECUTED_IN (bb, loop);
> >> if (bitmap_bit_p (contains_call, bb->index))
> >> break;
> >
> > I think you want to continue; here (process remaining worklist
> > but not continue greedy walking this block)
>
> Same as above, if use 'continue' instead of 'break', the algorithm
> seems also not work again. If inner loop contains a jump to outmost
> loop, the blocks after the jump block will be set to ALWAYS EXECUTE
> incorrectly.
>
> >
> >> -
> >> + edge_iterator ei;
> >> FOR_EACH_EDGE (e, ei, bb->succs)
> >> {
> >> - /* If there is an exit from this BB. */
> >> if (!flow_bb_inside_loop_p (loop, e->dest))
> >> break;
> >
> > in particular this should keep the outer 'bbi' valid to re-use.
> >
> > But again, you want 'continue;' the greedy walk to other edges.
> > If that's not valid (I'd need to think about this) then with
> > your patch whether we process an edge depends on the order
> > of the edge visit so you'd have to walk successors twice,
> > once to determine whether we can greedily walk any of it
> > and once to actually do the greedy walk.
> >
> > So thinking about it an exit edge is like a not returning call
> > and thus we indeed should not process any outgoing edges of
> > this block.
> >
> >> +
> >> /* Or we enter a possibly non-finite loop. */
> >> if (flow_loop_nested_p (bb->loop_father,
> >> e->dest->loop_father)
> >> && ! finite_loop_p (e->dest->loop_father))
> >> break;
> >
> > I think this is no longer necessary? In any case it would
> > again be 'continue;', see also above.
> >
> > I'm missing skipping of backedges in the walk.
> >
> >> +
> >> + if (bbd[e->dest->index] == 1)
> >> + {
> >> + bitmap_set_bit (work_set, e->dest->index);
> >> + bbd[e->dest->index]--;
> >> + }
> >> + else if (bbd[e->dest->index] > 1)
> >> + bbd[e->dest->index]--;
> >
> > maybe just
> >
> > bbd[e->dest->index]--;
> > if (bbd[e->dest->index] == 0)
> > bitmap_set_bit (work_set, e->dest->index);
>
> They are not equivalent. Former won't decrease if bbd[e->dest->index]
> is 0, but the latter does.
we should be never visiting in this case, so sth indeed doesn't
work as I expected.
> >
> >> }
> >> +
> >> if (e)
> >> break;
> >
> > continue; processing the worklist
> >
> >>
> >> @@ -3232,34 +3245,12 @@ fill_always_executed_in_1 (class loop *loop,
> >> @@ sbitmap contains_call)
> >> to disprove this if possible). */
> >> if (bb->flags & BB_IRREDUCIBLE_LOOP)
> >> break;
> >
> > continue; processing the worklist. I think this has to
> > come early, before processing successors, next to
> > the contains_call processing. We could think of ignoring
> > entering of irreducible regions by tracking exits from them,
> > but I'm not sure we're identifying the regions in a good
> > enough way to allow this.
> >
> > Likewise possibly infinite inner loops can be processed
> > when we track exits from those which would be sth like
> >
> > /* When this is an exit from an inner loop that we cannot
> > prove as finite do not follow this edge. */
> > if (bb->loop_father != loop
> > && loop_exit_edge_p (bb->loop_father, e)
> > && ! finite_loop_p (bb->loop_father))
> > continue;
> >
> >> -
> >> - if (!flow_bb_inside_loop_p (inn_loop, bb))
> >> - break;
> >> -
> >> - if (bb->loop_father->header == bb)
> >> - {
> >> - if (!dominated_by_p (CDI_DOMINATORS, loop->latch, bb))
> >> - break;
> >> -
> >> - /* In a loop that is always entered we may proceed anyway.
> >> - But record that we entered it and stop once we leave it. */
> >> - inn_loop = bb->loop_father;
> >> - }
> >> - }
> >> -
> >> - while (1)
> >> - {
> >> - SET_ALWAYS_EXECUTED_IN (last, loop);
> >> - if (last == loop->header)
> >> - break;
> >> - last = get_immediate_dominator (CDI_DOMINATORS, last);
> >> }
> >> -
> >> - free (bbs);
> >> + free (bbd);
> >> }
> >>
> >> for (loop = loop->inner; loop; loop = loop->next)
> >> - fill_always_executed_in_1 (loop, contains_call);
> >> + fill_always_executed_in_1 (loop, contains_call, bbi);
> >> }
> >>
> >> /* Fills ALWAYS_EXECUTED_IN information for basic blocks, i.e.
> >> @@ -3287,8 +3278,36 @@ fill_always_executed_in (void)
> >> bitmap_set_bit (contains_call, bb->index);
> >> }
> >> + mark_dfs_back_edges ();
> >
> > Please put this to loop_invariant_motion_in_fun right before
> > the call to fill_always_executed_in.
>
> Done.
>
> >
> >> + unsigned array_size = last_basic_block_for_fn (cfun) + 1;
> >> + int *bb_incoming_edges = XNEWVEC (int, array_size);
> >
> > There's XCNEWVEC that also zeros the array.
>
> OK, thanks.
>
> >
> >> + memset (bb_incoming_edges, 0, array_size * sizeof (int));
> >> + edge_iterator ei;
> >> + edge e;
> >> +
> >> + FOR_EACH_BB_FN (bb, cfun)
> >> + {
> >> + int len = bb->preds->length ();
> >> + FOR_EACH_EDGE (e, ei, bb->preds)
> >> + if (e->flags & EDGE_DFS_BACK)
> >> + len--;
> >> + bb_incoming_edges[bb->index] = len;
> >> + }
> >
> > it would be nice to combine this with the preceeding loop over
> > all blocks.
>
> Done.
>
> >
> >> + static unsigned long diff = 0;
> >> + struct timeval tv;
> >> + gettimeofday (&tv, NULL);
> >> + unsigned long begin = (unsigned long) tv.tv_sec * 1000000 + tv.tv_usec;
> >> +
> >> + //for (loop : loops_list (cfun, 0))
> >> for (loop = current_loops->tree_root->inner; loop; loop = loop->next)
> >> - fill_always_executed_in_1 (loop, contains_call);
> >> + fill_always_executed_in_1 (loop, contains_call, bb_incoming_edges);
> >> +
> >> + gettimeofday (&tv, NULL);
> >> + unsigned long end = (unsigned long) tv.tv_sec * 1000000 + tv.tv_usec;
> >> + diff += end - begin;
> >> + //printf ("%s,diff:%ld\n", current_function_name (), diff);
> >> + free (bb_incoming_edges);
> >
> > Eh, looks very much like work in progress. From the O() complexity
> > point duplicating bb_incoming_edges for each loop makes it quadratic
> > again but as said I think we don't need this duplication.
>
> Sorry I didn't quite follow your above comments about reusing
> bb_incoming_edges,
> we are searching the bbs that dominate loop->latch loop by loop independently,
> if outer loop changes the inner loops bb_incoming_edges, it will make the
> inner
> loop's ALWAYS EXECUTE computation incorrect?
We're only supposed to greedily walk into always excuted blocks. I see
this probably breaks down for conditionally executed inner loops like
for ()
{
if ()
for ();
A;
}
where we want to process the inner loop separately but we still want
to make sure A is marked as always executed in the outer loop.
> Seems duplicating bb_incoming_edges not increases the complexity but the
> recursive call does?
Since the bb_incoming_edges array has a size of N and we copy it
number_of_loop times (which is O(N) bound) the copying effectively
has O(N^2) cost. So yes, it's the iteration over all loops,
whether by recursion or iteration.
I wonder if we can more cleverly handle this by picking up
conditionally executed inner loops on the fly for example
by recording a 'outermost unconditionally executed loop'
for each loop header we visit and using that like
SET_ALWAYS_EXECUTED_IN (bb, outermost_always_exec_in (bb->loop_father))
So make this a single greedy CFG walk starting at the outermost
loops. loop exists will need to be processed specially then.
Richard.
> >
> > I'd like to see the recursion removed, did the for (loop : ...)
> > not work out?
>
> The recursion in fill_always_executed_in_1 could be removed since loops_list
> could iterate all inner loops but
> "for (loop = current_loops->tree_root->inner; loop; loop = loop->next)"
> only iterate sibling level loops?
>
>
--
Richard Biener <[email protected]>
SUSE Software Solutions Germany GmbH, Maxfeldstrasse 5, 90409 Nuernberg,
Germany; GF: Felix Imendörffer; HRB 36809 (AG Nuernberg)