On Mon, 30 Aug 2021, Xionghu Luo wrote: > > > On 2021/8/27 15:45, Richard Biener wrote: > > On Thu, 26 Aug 2021, Xionghu Luo wrote: > > > >> > >> > >> On 2021/8/24 16:20, Richard Biener wrote: > >>> On Tue, 24 Aug 2021, Xionghu Luo wrote: > >>> > >>>> > >>>> > >>>> On 2021/8/19 20:11, Richard Biener wrote: > >>>>>> - class loop *inn_loop = loop; > >>>>>> > >>>>>> if (ALWAYS_EXECUTED_IN (loop->header) == NULL) > >>>>>> { > >>>>>> @@ -3232,19 +3231,6 @@ fill_always_executed_in_1 (class loop *loop, > >>>>>> @@ sbitmap contains_call) > >>>>>> to disprove this if possible). */ > >>>>>> if (bb->flags & BB_IRREDUCIBLE_LOOP) > >>>>>> break; > >>>>>> - > >>>>>> - if (!flow_bb_inside_loop_p (inn_loop, bb)) > >>>>>> - break; > >>>>>> - > >>>>>> - if (bb->loop_father->header == bb) > >>>>>> - { > >>>>>> - if (!dominated_by_p (CDI_DOMINATORS, loop->latch, bb)) > >>>>>> - break; > >>>>>> - > >>>>>> - /* In a loop that is always entered we may proceed > >>>>>> anyway. > >>>>>> - But record that we entered it and stop once we leave > >>>>>> it. */ > >>>>>> - inn_loop = bb->loop_father; > >>>>>> - } > >>>>>> } > >>>>>> > >>>>>> while (1) > >>>>> I'm not sure this will work correct (I'm not sure how the existing > >>>>> code makes it so either...). That said, I can't poke any hole > >>>>> into the change. What I see is that definitely > >>>>> > >>>>> if (dominated_by_p (CDI_DOMINATORS, loop->latch, bb)) > >>>>> last = bb; > >>>>> > >>>>> if (bitmap_bit_p (contains_call, bb->index)) > >>>>> break; > >>>>> > >>>>> doesn't work reliably since the DOM ordering will process blocks > >>>>> A B and C in random order for > >>>>> > >>>>> for (;;) > >>>>> { > >>>>> if (cond) > >>>>> { > >>>>> A: foo (); > >>>>> } > >>>>> else B:; > >>>>> C:; > >>>>> } > >>>>> > >>>>> and thus we can end up setting 'last' to C_before_ processing > >>>>> 'A' and thus arriving at the call foo () ... > >>>>> > >>>>> get_loop_body_in_dom_order does some "special sauce" but not > >>>>> to address the above problem - but it might be that a subtle > >>>>> issue like the above is the reason for the inner loop handling. > >>>>> The inner loop block order does_not_ adhere to this "special sauce", > >>>>> that is - the "Additionally, if a basic block s dominates > >>>>> the latch, then only blocks dominated by s are be after it." > >>>>> guarantee holds for the outer loop latch, not for the inner. > >>>>> > >>>>> Digging into the history of fill_always_executed_in_1 doesn't > >>>>> reveal anything - the inner loop handling has been present > >>>>> since introduction by Zdenek - but usually Zdenek has a reason > >>>>> for doing things as he does;) > >>>> > >>>> Yes, this is really complicated usage, thanks for point it out. :) > >>>> I constructed two cases to verify this with inner loop includes "If A; > >>>> else B; C". > >>>> Finding that fill_sons_in_loop in get_loop_body_in_dom_order will also > >>>> checks > >>>> whether the bb domintes outer loop’s latch, if C dominate outer loop’s > >>>> latch, > >>>> C is postponed, the access order is ABC, 'last' won’t be set to C if A or > >>>> B contains call; > >>> > >>> But it depends on the order of visiting ABC and that's hard to put into > >>> a testcase since it depends on the order of edges and the processing > >>> of the dominance computation. ABC are simply unordered with respect > >>> to a dominator walk. > >>> > >>>> Otherwise if C doesn’t dominate outer loop’s latch in fill_sons_in_loop, > >>>> the access order is CAB, but 'last' also won’t be updated to C in > >>>> fill_always_executed_in_1 > >>>> since there is also dominate check, then if A or B contains call, it > >>>> could break > >>>> successfully. > >>>> > >>>> C won't be set to ALWAYS EXECUTED for both circumstance. > >>>> > >>>>> > >>>>> Note it might be simply a measure against quadratic complexity, > >>>>> esp. since with your patch we also dive into not always executed > >>>>> subloops as you remove the > >>>>> > >>>>> if (!dominated_by_p (CDI_DOMINATORS, loop->latch, bb)) > >>>>> break; > >>>>> > >>>>> check. I suggest to evaluate behavior of the patch on a testcase > >>>>> like > >>>>> > >>>>> void foo (int n, int **k) > >>>>> { > >>>>> for (int i = 0; i < n; ++i) > >>>>> if (k[0][i]) > >>>>> for (int j = 0; j < n; ++j) > >>>>> if (k[1][j]) > >>>>> for (int l = 0; l < n; ++l) > >>>>> if (k[2][l]) > >>>>> ... > >>>>> } > >>>> > >>>> Theoretically the complexity is changing from L1(bbs) to > >>>> L1(bbs)+L2(bbs)+L3(bbs)+…+Ln(bbs), > >>>> so fill_always_executed_in_1's execution time is supposed to be increase > >>>> from > >>>> O(n) to O(n2)? The time should depend on loop depth and bb counts. I > >>>> also drafted a > >>>> test case has 73-depth loop function with 25 no-ipa function copies each > >>>> compiled > >>>> in lim2 and lim4 dependently. Total execution time of > >>>> fill_always_executed_in_1 is > >>>> increased from 32ms to 58ms, almost doubled but not quadratic? > >>> > >>> It's more like n + (n-1) + (n-2) + ... + 1 which is n^2/2 but that's still > >>> O(n^2). > >>> > >>>> It seems reasonable to see compiling time getting longer since most bbs > >>>> are checked > >>>> more but a MUST to ensure early break correctly in every loop level... > >>>> Though loop nodes could be huge, loop depth will never be so large in > >>>> actual code? > >>> > >>> The "in practice" argument is almost always defeated by automatic > >>> program generators ;) > >>> > >>>>> I suspect you'll see quadratic behavior with your patch. You > >>>>> should be at least able to preserve a check like > >>>>> > >>>>> /* Do not process not always executed subloops to avoid > >>>>> quadratic behavior. */ > >>>>> if (bb->loop_father->header == bb > >>>>> && !dominated_by_p (CDI_DOMINATORS, loop->latch, bb)) > >>>>> break; > >>>>> > >>>>> which is of course not optimistic for cases like > >>>>> > >>>>> for (..) > >>>>> { > >>>>> if (cond) > >>>>> for (..) > >>>>> x = 1; // this is always executed if the inner loop is finite > >>>>> } > >>>>> > >>>>> but we need to have an eye on the complexity of this function. I would > >>>>> have suggested to do greedy visiting of the loop header successors, > >>>>> processing further blocks if all entries (ignoring backedges) are > >>>>> processed, setting SET_ALWAYS_EXECUTED_IN. When the worklist > >>>>> is empty proceed to inner loops as the current code does. For > >>>>> bookkeeping simply keep a to-visit-incoming-edges counter per BB. > >>>>> Pseudo-code: > >>>>> > >>>>> bitmap_set_bit (worklist, loop-header-bb); > >>>>> while (!bitmap_empty_p (worklist)) > >>>>> { > >>>>> bb = pop (worklist); > >>>> > >>>> Need check whether bb dominates latch before SET_ALWAYS_EXECUTED_IN? > >>> > >>> Ah, sure. > >>> > >>>>> SET_ALWAYS_EXECUTED_IN (bb, loop); > >>>>> if (bitmap_bit_p (contains_call, bb->index)) > >>>>> continue; > >>>>> FOR_EACH_EDGE (e, ei, bb->succs) > >>>>> { > >>>>> if (!flow_bb_inside_loop_p (loop, e->dest)) > >>>>> continue; > >>>>> if (incoming_count[e->dest->index]-- == 0) > >>>>> push (worklist, e->dest); > >>>>> } > >>>>> } > >>>> > >>>> Sorry I don't fully understand your algorithm. worklist should be > >>>> auto_vec<basic_block> don't support bitmap operations? Is incoming_count > >>>> the bb's preds count, why need retain it since it it decreased to 0? > >>> > >>> the worklist is a auto_bitmap, pop () would be > >>> bitmap_first_set_bit/clear_bit. One could use a vector with the > >>> caveat of eventually adding duplicate members to the worklist. > >>> But as said, it's pseudo-code ;) > >> > >> Thanks a lot! > >> > >>> > >>>>> > >>>>> iterate over inner loops (incoming_count can be retained, > >>>>> we just force the inner loop header onto the worklist). > >>>> > >>>> Is this same to ? > >>>> > >>>> for (loop = loop->inner; loop; loop = loop->next) > >>>> fill_always_executed_in_1 (loop, contains_call) > >>> > >>> Yes, but I'd simply use for (loop : loops_list (cfun, 0)) which > >>> should iterate over loops in PRE order. > >> > >> Tried implementation with your pseudo-code, it works like the > >> previous v2 solution, bb after inner loops could be marked > >> as ALWAYS EXECUTED. You are really so strong! ;) > >> Regression tested pass on P8LE. > >> > >> Assume, > >> > >> A: GCC Base > >> B: inner loop check removal > >> C: incoming count > >> > >> Execution time of fill_always_executed_in_1 for the 73-depth loop > >> (with five function copies) is > >> (A vs. B vs. C): 32ms vs 58ms vs 38ms. Much better than O(n^2)? > > Sorry didn't get C's execution time accurate enough due to forgot to remove > dump file code in it. > C's execution time is 25 ms after code refine. It's even better than A. > Attached the test case ssa-lim-22.c used for time measurement. > > >> > >> > >> Bumped the patch to v3 with TODOs: > >> 1. The time measurement code will be removed if the v3 is OK; > >> 2. loops_list is not used as my code is not rebased to upstream yet. > >> 3. Function comments is not updated. > >> > >> > >> > >> [PATCH v3] Fix incomplete computation in fill_always_executed_in_1 > >> > >> > >> ALWAYS_EXECUTED_IN is not computed completely for nested loops. > >> Current design will exit if an inner loop doesn't dominate outer > >> loop's latch or exit after exiting from inner loop, which > >> caused early return from outer loop, then ALWAYS EXECUTED blocks after > >> inner loops are skipped. Another problem is it has to call > >> get_loop_body_in_dom_order in each recursive call which is also > >> inefficient. > >> > >> This patch does greedy visiting of the loop header successors, > >> processing further blocks if all entries (ignoring backedges) are > >> processed, setting SET_ALWAYS_EXECUTED_IN of dominates loop's latch. > >> When the worklist is empty proceed to inner loops as the current > >> code does. For bookkeeping simply keep a to-visit-incoming-edges > >> counter per BB, and pass it through iteration over inner loops. > >> > >> Details discusions: > >> https://gcc.gnu.org/pipermail/gcc-patches/2021-August/577743.html > >> > >> gcc/ChangeLog: > >> > >> * tree-ssa-loop-im.c (fill_always_executed_in_1): Remove > >> inn_loop check. > >> > >> gcc/testsuite/ChangeLog: > >> > >> * gcc.dg/tree-ssa/ssa-lim-19.c: New test. > >> * gcc.dg/tree-ssa/ssa-lim-20.c: New test. > >> --- > >> gcc/tree-ssa-loop-im.c | 95 +++++++++++++--------- > >> gcc/testsuite/gcc.dg/tree-ssa/ssa-lim-19.c | 24 ++++++ > >> gcc/testsuite/gcc.dg/tree-ssa/ssa-lim-20.c | 30 +++++++ > >> 3 files changed, 111 insertions(+), 38 deletions(-) > >> create mode 100644 gcc/testsuite/gcc.dg/tree-ssa/ssa-lim-19.c > >> create mode 100644 gcc/testsuite/gcc.dg/tree-ssa/ssa-lim-20.c > >> > >> diff --git a/gcc/tree-ssa-loop-im.c b/gcc/tree-ssa-loop-im.c > >> index b24bc64f2a7..72be6b6bfac 100644 > >> --- a/gcc/tree-ssa-loop-im.c > >> +++ b/gcc/tree-ssa-loop-im.c > >> @@ -3192,39 +3192,52 @@ do_store_motion (void) > >> blocks that contain a nonpure call. */ > >> > >> static void > >> -fill_always_executed_in_1 (class loop *loop, sbitmap contains_call) > >> +fill_always_executed_in_1 (class loop *loop, sbitmap contains_call, int > >> *bbi) > >> { > >> - basic_block bb = NULL, *bbs, last = NULL; > >> - unsigned i; > >> + basic_block bb = NULL; > >> edge e; > >> - class loop *inn_loop = loop; > >> > >> if (ALWAYS_EXECUTED_IN (loop->header) == NULL) > >> { > >> - bbs = get_loop_body_in_dom_order (loop); > >> + auto_bitmap work_set; > >> + bitmap_clear (work_set); > > > > bitmap_clear isn't necessary > > Removed. > > > > >> + bitmap_set_bit (work_set, loop->header->index); > >> + unsigned bb_index; > >> - for (i = 0; i < loop->num_nodes; i++) > >> - { > >> - edge_iterator ei; > >> - bb = bbs[i]; > >> + unsigned array_size = last_basic_block_for_fn (cfun) + 1; > >> + int *bbd = XNEWVEC (int, array_size); > >> + bbd = XDUPVEC (int, bbi, array_size); > > > > I don't think you need to copy 'bbi' but you can re-use the > > state from the outer loop processing. Did you run into any > > issues with that? > > Yes. For example, adding a small if-else block to ssa-lim-19.c, > Then block "x->j += tem * i;" of bb 6 is always executed for loop 2, when call > fill_always_executed_in_1 for loop 1, bbi[6] is decreased from 2 to 1 to 0, > then if fill_always_executed_in_1 is called again for loop 2, it's value is > not > reset so bbi[6] won't be set ALWAYS EXECUTE, this is wrong. > > > struct X { int i; int j; int k;}; > > void foo(struct X *x, int n, int l, int m) > { > for (int j = 0; j < l; j++) // loop 1 > { > for (int i = 0; i < n; ++i) // loop 2 > { > if (m) > x->j++; > else > x->j = m+n+l; > > int *p = &x->j; // bb 6 > int tem = *p; > x->j += tem * i; > } > int *r = &x->k; > int tem2 = *r; > x->k += tem2 * j; > } > }
Hmm, but if the outer loop processing reaches bb 6 then it should have set it ALWAYS_EXECUTED in loop 1 already? > > > > > >> + while (!bitmap_empty_p (work_set)) > >> + { > >> + bb_index = bitmap_first_set_bit (work_set); > >> + bitmap_clear_bit (work_set, bb_index); > >> + bb = BASIC_BLOCK_FOR_FN (cfun, bb_index); > >> if (dominated_by_p (CDI_DOMINATORS, loop->latch, bb)) > >> - last = bb; > >> - > >> + SET_ALWAYS_EXECUTED_IN (bb, loop); > >> if (bitmap_bit_p (contains_call, bb->index)) > >> break; > > > > I think you want to continue; here (process remaining worklist > > but not continue greedy walking this block) > > Same as above, if use 'continue' instead of 'break', the algorithm > seems also not work again. If inner loop contains a jump to outmost > loop, the blocks after the jump block will be set to ALWAYS EXECUTE > incorrectly. > > > > >> - > >> + edge_iterator ei; > >> FOR_EACH_EDGE (e, ei, bb->succs) > >> { > >> - /* If there is an exit from this BB. */ > >> if (!flow_bb_inside_loop_p (loop, e->dest)) > >> break; > > > > in particular this should keep the outer 'bbi' valid to re-use. > > > > But again, you want 'continue;' the greedy walk to other edges. > > If that's not valid (I'd need to think about this) then with > > your patch whether we process an edge depends on the order > > of the edge visit so you'd have to walk successors twice, > > once to determine whether we can greedily walk any of it > > and once to actually do the greedy walk. > > > > So thinking about it an exit edge is like a not returning call > > and thus we indeed should not process any outgoing edges of > > this block. > > > >> + > >> /* Or we enter a possibly non-finite loop. */ > >> if (flow_loop_nested_p (bb->loop_father, > >> e->dest->loop_father) > >> && ! finite_loop_p (e->dest->loop_father)) > >> break; > > > > I think this is no longer necessary? In any case it would > > again be 'continue;', see also above. > > > > I'm missing skipping of backedges in the walk. > > > >> + > >> + if (bbd[e->dest->index] == 1) > >> + { > >> + bitmap_set_bit (work_set, e->dest->index); > >> + bbd[e->dest->index]--; > >> + } > >> + else if (bbd[e->dest->index] > 1) > >> + bbd[e->dest->index]--; > > > > maybe just > > > > bbd[e->dest->index]--; > > if (bbd[e->dest->index] == 0) > > bitmap_set_bit (work_set, e->dest->index); > > They are not equivalent. Former won't decrease if bbd[e->dest->index] > is 0, but the latter does. we should be never visiting in this case, so sth indeed doesn't work as I expected. > > > >> } > >> + > >> if (e) > >> break; > > > > continue; processing the worklist > > > >> > >> @@ -3232,34 +3245,12 @@ fill_always_executed_in_1 (class loop *loop, > >> @@ sbitmap contains_call) > >> to disprove this if possible). */ > >> if (bb->flags & BB_IRREDUCIBLE_LOOP) > >> break; > > > > continue; processing the worklist. I think this has to > > come early, before processing successors, next to > > the contains_call processing. We could think of ignoring > > entering of irreducible regions by tracking exits from them, > > but I'm not sure we're identifying the regions in a good > > enough way to allow this. > > > > Likewise possibly infinite inner loops can be processed > > when we track exits from those which would be sth like > > > > /* When this is an exit from an inner loop that we cannot > > prove as finite do not follow this edge. */ > > if (bb->loop_father != loop > > && loop_exit_edge_p (bb->loop_father, e) > > && ! finite_loop_p (bb->loop_father)) > > continue; > > > >> - > >> - if (!flow_bb_inside_loop_p (inn_loop, bb)) > >> - break; > >> - > >> - if (bb->loop_father->header == bb) > >> - { > >> - if (!dominated_by_p (CDI_DOMINATORS, loop->latch, bb)) > >> - break; > >> - > >> - /* In a loop that is always entered we may proceed anyway. > >> - But record that we entered it and stop once we leave it. */ > >> - inn_loop = bb->loop_father; > >> - } > >> - } > >> - > >> - while (1) > >> - { > >> - SET_ALWAYS_EXECUTED_IN (last, loop); > >> - if (last == loop->header) > >> - break; > >> - last = get_immediate_dominator (CDI_DOMINATORS, last); > >> } > >> - > >> - free (bbs); > >> + free (bbd); > >> } > >> > >> for (loop = loop->inner; loop; loop = loop->next) > >> - fill_always_executed_in_1 (loop, contains_call); > >> + fill_always_executed_in_1 (loop, contains_call, bbi); > >> } > >> > >> /* Fills ALWAYS_EXECUTED_IN information for basic blocks, i.e. > >> @@ -3287,8 +3278,36 @@ fill_always_executed_in (void) > >> bitmap_set_bit (contains_call, bb->index); > >> } > >> + mark_dfs_back_edges (); > > > > Please put this to loop_invariant_motion_in_fun right before > > the call to fill_always_executed_in. > > Done. > > > > >> + unsigned array_size = last_basic_block_for_fn (cfun) + 1; > >> + int *bb_incoming_edges = XNEWVEC (int, array_size); > > > > There's XCNEWVEC that also zeros the array. > > OK, thanks. > > > > >> + memset (bb_incoming_edges, 0, array_size * sizeof (int)); > >> + edge_iterator ei; > >> + edge e; > >> + > >> + FOR_EACH_BB_FN (bb, cfun) > >> + { > >> + int len = bb->preds->length (); > >> + FOR_EACH_EDGE (e, ei, bb->preds) > >> + if (e->flags & EDGE_DFS_BACK) > >> + len--; > >> + bb_incoming_edges[bb->index] = len; > >> + } > > > > it would be nice to combine this with the preceeding loop over > > all blocks. > > Done. > > > > >> + static unsigned long diff = 0; > >> + struct timeval tv; > >> + gettimeofday (&tv, NULL); > >> + unsigned long begin = (unsigned long) tv.tv_sec * 1000000 + tv.tv_usec; > >> + > >> + //for (loop : loops_list (cfun, 0)) > >> for (loop = current_loops->tree_root->inner; loop; loop = loop->next) > >> - fill_always_executed_in_1 (loop, contains_call); > >> + fill_always_executed_in_1 (loop, contains_call, bb_incoming_edges); > >> + > >> + gettimeofday (&tv, NULL); > >> + unsigned long end = (unsigned long) tv.tv_sec * 1000000 + tv.tv_usec; > >> + diff += end - begin; > >> + //printf ("%s,diff:%ld\n", current_function_name (), diff); > >> + free (bb_incoming_edges); > > > > Eh, looks very much like work in progress. From the O() complexity > > point duplicating bb_incoming_edges for each loop makes it quadratic > > again but as said I think we don't need this duplication. > > Sorry I didn't quite follow your above comments about reusing > bb_incoming_edges, > we are searching the bbs that dominate loop->latch loop by loop independently, > if outer loop changes the inner loops bb_incoming_edges, it will make the > inner > loop's ALWAYS EXECUTE computation incorrect? We're only supposed to greedily walk into always excuted blocks. I see this probably breaks down for conditionally executed inner loops like for () { if () for (); A; } where we want to process the inner loop separately but we still want to make sure A is marked as always executed in the outer loop. > Seems duplicating bb_incoming_edges not increases the complexity but the > recursive call does? Since the bb_incoming_edges array has a size of N and we copy it number_of_loop times (which is O(N) bound) the copying effectively has O(N^2) cost. So yes, it's the iteration over all loops, whether by recursion or iteration. I wonder if we can more cleverly handle this by picking up conditionally executed inner loops on the fly for example by recording a 'outermost unconditionally executed loop' for each loop header we visit and using that like SET_ALWAYS_EXECUTED_IN (bb, outermost_always_exec_in (bb->loop_father)) So make this a single greedy CFG walk starting at the outermost loops. loop exists will need to be processed specially then. Richard. > > > > I'd like to see the recursion removed, did the for (loop : ...) > > not work out? > > The recursion in fill_always_executed_in_1 could be removed since loops_list > could iterate all inner loops but > "for (loop = current_loops->tree_root->inner; loop; loop = loop->next)" > only iterate sibling level loops? > > -- Richard Biener <rguent...@suse.de> SUSE Software Solutions Germany GmbH, Maxfeldstrasse 5, 90409 Nuernberg, Germany; GF: Felix Imendörffer; HRB 36809 (AG Nuernberg)