On Tue, 31 Aug 2021, Xionghu Luo wrote:
>
>
> On 2021/8/30 17:19, Richard Biener wrote:
> >>>> bitmap_set_bit (work_set, loop->header->index);
> >>>> + unsigned bb_index;
> >>>> - for (i = 0; i < loop->num_nodes; i++)
> >>>> - {
> >>>> - edge_iterator ei;
> >>>> - bb = bbs[i];
> >>>> + unsigned array_size = last_basic_block_for_fn (cfun) + 1;
> >>>> + int *bbd = XNEWVEC (int, array_size);
> >>>> + bbd = XDUPVEC (int, bbi, array_size);
> >>> I don't think you need to copy 'bbi' but you can re-use the
> >>> state from the outer loop processing. Did you run into any
> >>> issues with that?
> >> Yes. For example, adding a small if-else block to ssa-lim-19.c,
> >> Then block "x->j += tem * i;" of bb 6 is always executed for loop 2, when
> >> call
> >> fill_always_executed_in_1 for loop 1, bbi[6] is decreased from 2 to 1 to 0,
> >> then if fill_always_executed_in_1 is called again for loop 2, it's value is
> >> not
> >> reset so bbi[6] won't be set ALWAYS EXECUTE, this is wrong.
> >>
> >>
> >> struct X { int i; int j; int k;};
> >>
> >> void foo(struct X *x, int n, int l, int m)
> >> {
> >> for (int j = 0; j < l; j++) // loop 1
> >> {
> >> for (int i = 0; i < n; ++i) // loop 2
> >> {
> >> if (m)
> >> x->j++;
> >> else
> >> x->j = m+n+l;
> >>
> >> int *p = &x->j; // bb 6
> >> int tem = *p;
> >> x->j += tem * i;
> >> }
> >> int *r = &x->k;
> >> int tem2 = *r;
> >> x->k += tem2 * j;
> >> }
> >> }
> > Hmm, but if the outer loop processing reaches bb 6 then
> > it should have set it ALWAYS_EXECUTED in loop 1 already?
>
> But bb 6 is NOT ALWAYS_EXECUTED for loop 1, it is only ALWAYS_EXECUTED for
> loop 2 as it requires n>0. Please refer to the attached file
> ssa-lim-19.c.138t.lim2.
>
> ;;
> ;; Loop 1
> ;; header 8, latch 12
> ;; depth 1, outer 0
> ;; nodes: 8 12 7 6 4 5 3 13 11
> ;;
> ;; Loop 2
> ;; header 3, latch 13
> ;; depth 2, outer 1
> ;; nodes: 3 13 6 4 5
> ;; 2 succs { 10 9 }
> ;; 10 succs { 8 }
> ;; 11 succs { 3 }
> ;; 3 succs { 4 5 }
> ;; 4 succs { 6 }
> ;; 5 succs { 6 }
> ;; 6 succs { 13 7 }
> ;; 13 succs { 3 }
> ;; 7 succs { 12 9 }
> ;; 12 succs { 8 }
> ;; 8 succs { 11 7 }
> ;; 9 succs { 1 }
>
> always executed: bb->index:8, loop->num: 1
> always executed: bb->index:7, loop->num: 1
> always executed: bb->index:3, loop->num: 2
> always executed: bb->index:6, loop->num: 2
>
> 8<---------------
> / \ |
> 11 \ |
> / \ |
> 3<--- \ |
> /\ | \ |
> 4 5 | \ |
> \/ | \ |
> 6 | \ |
> |-->13 \ |
> |----------> 7 |
> /\ |
> 9 12---
>
> (gdb) x /15x bbd
> 0x1354c9b0: 0x00000000 0x00000000 0x00000001 0x00000001
> 0x1354c9c0: 0x00000001 0x00000001 0x00000002 0x00000002
> 0x1354c9d0: 0x00000001 0x00000002 0x00000001 0x00000001
> 0x1354c9e0: 0x00000001 0x00000001 0x00000000
>
> our algorithm will walk through 8->11->3->4->5->6->7,
> for loop 1, exit at edge 7->9.
>
> (gdb) x /15x bbd
> 0x1354c9b0: 0x00000000 0x00000000 0x00000001 0x00000000
> 0x1354c9c0: 0x00000000 0x00000000 0x00000000 0x00000000
> 0x1354c9d0: 0x00000001 0x00000002 0x00000001 0x00000000
> 0x1354c9e0: 0x00000001 0x00000000 0x00000000
>
> If we don't reset bbd to incoming_edge by memcpy, bbd[3],bbd[4],bbd[5]
> and bbd[6] is 0 now for loop 2, fill_always_executed_in_1 couldn't set
> ALWAYS_EXECUTED correctly for loop 2 at bb 3 and bb 6.
>
> >
> >>
> >>>
> >>>> + while (!bitmap_empty_p (work_set))
> >>>> + {
> >>>> + bb_index = bitmap_first_set_bit (work_set);
> >>>> + bitmap_clear_bit (work_set, bb_index);
> >>>> + bb = BASIC_BLOCK_FOR_FN (cfun, bb_index);
> >>>> if (dominated_by_p (CDI_DOMINATORS, loop->latch, bb))
> >>>> - last = bb;
> >>>> -
> >>>> + SET_ALWAYS_EXECUTED_IN (bb, loop);
> >>>> if (bitmap_bit_p (contains_call, bb->index))
> >>>> break;
> >>> I think you want to continue; here (process remaining worklist
> >>> but not continue greedy walking this block)
> >> Same as above, if use 'continue' instead of 'break', the algorithm
> >> seems also not work again. If inner loop contains a jump to outmost
> >> loop, the blocks after the jump block will be set to ALWAYS EXECUTE
> >> incorrectly.
> >>
> >>>> -
> >>>> + edge_iterator ei;
> >>>> FOR_EACH_EDGE (e, ei, bb->succs)
> >>>> {
> >>>> - /* If there is an exit from this BB. */
> >>>> if (!flow_bb_inside_loop_p (loop, e->dest))
> >>>> break;
> >>> in particular this should keep the outer 'bbi' valid to re-use.
> >>>
> >>> But again, you want 'continue;' the greedy walk to other edges.
> >>> If that's not valid (I'd need to think about this) then with
> >>> your patch whether we process an edge depends on the order
> >>> of the edge visit so you'd have to walk successors twice,
> >>> once to determine whether we can greedily walk any of it
> >>> and once to actually do the greedy walk.
> >>>
> >>> So thinking about it an exit edge is like a not returning call
> >>> and thus we indeed should not process any outgoing edges of
> >>> this block.
> >>>
> >>>> +
> >>>> /* Or we enter a possibly non-finite loop. */
> >>>> if (flow_loop_nested_p (bb->loop_father,
> >>>> e->dest->loop_father)
> >>>> && ! finite_loop_p (e->dest->loop_father))
> >>>> break;
> >>> I think this is no longer necessary? In any case it would
> >>> again be 'continue;', see also above.
> >>>
> >>> I'm missing skipping of backedges in the walk.
> >>>
> >>>> +
> >>>> + if (bbd[e->dest->index] == 1)
> >>>> + {
> >>>> + bitmap_set_bit (work_set, e->dest->index);
> >>>> + bbd[e->dest->index]--;
> >>>> + }
> >>>> + else if (bbd[e->dest->index] > 1)
> >>>> + bbd[e->dest->index]--;
> >>> maybe just
> >>>
> >>> bbd[e->dest->index]--;
> >>> if (bbd[e->dest->index] == 0)
> >>> bitmap_set_bit (work_set, e->dest->index);
> >> They are not equivalent. Former won't decrease if bbd[e->dest->index]
> >> is 0, but the latter does.
> > we should be never visiting in this case, so sth indeed doesn't
> > work as I expected.
>
> I've verified this won't happen if duplicate the bb_incoming_edges
> in fill_always_executed_in_1, so could replace it.
>
> >
> >>>> }
> >>>> +
> >>>> if (e)
> >>>> break;
> >>> continue; processing the worklist
> >>>
> >>>>
> >>>> @@ -3232,34 +3245,12 @@ fill_always_executed_in_1 (class loop *loop,
> >>>> @@ sbitmap contains_call)
> >>>> to disprove this if possible). */
> >>>> if (bb->flags & BB_IRREDUCIBLE_LOOP)
> >>>> break;
> >>> continue; processing the worklist. I think this has to
> >>> come early, before processing successors, next to
> >>> the contains_call processing. We could think of ignoring
> >>> entering of irreducible regions by tracking exits from them,
> >>> but I'm not sure we're identifying the regions in a good
> >>> enough way to allow this.
> >>>
> >>> Likewise possibly infinite inner loops can be processed
> >>> when we track exits from those which would be sth like
> >>>
> >>> /* When this is an exit from an inner loop that we cannot
> >>> prove as finite do not follow this edge. */
> >>> if (bb->loop_father != loop
> >>> && loop_exit_edge_p (bb->loop_father, e)
> >>> && ! finite_loop_p (bb->loop_father))
> >>> continue;
> >>>
> >>>> -
> >>>> - if (!flow_bb_inside_loop_p (inn_loop, bb))
> >>>> - break;
> >>>> -
> >>>> - if (bb->loop_father->header == bb)
> >>>> - {
> >>>> - if (!dominated_by_p (CDI_DOMINATORS, loop->latch, bb))
> >>>> - break;
> >>>> -
> >>>> - /* In a loop that is always entered we may proceed anyway.
> >>>> - But record that we entered it and stop once we leave
> >>>> it. */
> >>>> - inn_loop = bb->loop_father;
> >>>> - }
> >>>> - }
> >>>> -
> >>>> - while (1)
> >>>> - {
> >>>> - SET_ALWAYS_EXECUTED_IN (last, loop);
> >>>> - if (last == loop->header)
> >>>> - break;
> >>>> - last = get_immediate_dominator (CDI_DOMINATORS, last);
> >>>> }
> >>>> -
> >>>> - free (bbs);
> >>>> + free (bbd);
> >>>> }
> >>>>
> >>>> for (loop = loop->inner; loop; loop = loop->next)
> >>>> - fill_always_executed_in_1 (loop, contains_call);
> >>>> + fill_always_executed_in_1 (loop, contains_call, bbi);
> >>>> }
> >>>>
> >>>> /* Fills ALWAYS_EXECUTED_IN information for basic blocks, i.e.
> >>>> @@ -3287,8 +3278,36 @@ fill_always_executed_in (void)
> >>>> bitmap_set_bit (contains_call, bb->index);
> >>>> }
> >>>> + mark_dfs_back_edges ();
> >>> Please put this to loop_invariant_motion_in_fun right before
> >>> the call to fill_always_executed_in.
> >> Done.
> >>
> >>>> + unsigned array_size = last_basic_block_for_fn (cfun) + 1;
> >>>> + int *bb_incoming_edges = XNEWVEC (int, array_size);
> >>> There's XCNEWVEC that also zeros the array.
> >> OK, thanks.
> >>
> >>>> + memset (bb_incoming_edges, 0, array_size * sizeof (int));
> >>>> + edge_iterator ei;
> >>>> + edge e;
> >>>> +
> >>>> + FOR_EACH_BB_FN (bb, cfun)
> >>>> + {
> >>>> + int len = bb->preds->length ();
> >>>> + FOR_EACH_EDGE (e, ei, bb->preds)
> >>>> + if (e->flags & EDGE_DFS_BACK)
> >>>> + len--;
> >>>> + bb_incoming_edges[bb->index] = len;
> >>>> + }
> >>> it would be nice to combine this with the preceeding loop over
> >>> all blocks.
> >> Done.
> >>
> >>>> + static unsigned long diff = 0;
> >>>> + struct timeval tv;
> >>>> + gettimeofday (&tv, NULL);
> >>>> + unsigned long begin = (unsigned long) tv.tv_sec * 1000000 +
> >>>> tv.tv_usec;
> >>>> +
> >>>> + //for (loop : loops_list (cfun, 0))
> >>>> for (loop = current_loops->tree_root->inner; loop; loop =
> >>>> loop->next)
> >>>> - fill_always_executed_in_1 (loop, contains_call);
> >>>> + fill_always_executed_in_1 (loop, contains_call, bb_incoming_edges);
> >>>> +
> >>>> + gettimeofday (&tv, NULL);
> >>>> + unsigned long end = (unsigned long) tv.tv_sec * 1000000 + tv.tv_usec;
> >>>> + diff += end - begin;
> >>>> + //printf ("%s,diff:%ld\n", current_function_name (), diff);
> >>>> + free (bb_incoming_edges);
> >>> Eh, looks very much like work in progress. From the O() complexity
> >>> point duplicating bb_incoming_edges for each loop makes it quadratic
> >>> again but as said I think we don't need this duplication.
> >> Sorry I didn't quite follow your above comments about reusing
> >> bb_incoming_edges,
> >> we are searching the bbs that dominate loop->latch loop by loop
> >> independently,
> >> if outer loop changes the inner loops bb_incoming_edges, it will make the
> >> inner
> >> loop's ALWAYS EXECUTE computation incorrect?
> > We're only supposed to greedily walk into always excuted blocks. I see
> > this probably breaks down for conditionally executed inner loops like
> >
> > for ()
> > {
> > if ()
> > for ();
> > A;
> > }
> >
> > where we want to process the inner loop separately but we still want
> > to make sure A is marked as always executed in the outer loop.
>
> Yes, the problem is ALWAYS_EXECUTE blocks are not SUCCESSIVE in CFG.
> so we have to check every bb and maintain bb_incoming_edges in each loop. If
> there is any block goes to outer loop, we have to 'break' instead of
> 'continue' as followed bbs are not ALWAYS_EXECUTE again.
But we're using a worklist (since the CFG branches) - if we're not
continue to pick items from the worklist then the outcome depends
on the order of the worklist proceesing. That can't be correct either
(similar when walking successor edges).
> >> Seems duplicating bb_incoming_edges not increases the complexity but the
> >> recursive call does?
> > Since the bb_incoming_edges array has a size of N and we copy it
> > number_of_loop times (which is O(N) bound) the copying effectively
> > has O(N^2) cost. So yes, it's the iteration over all loops,
> > whether by recursion or iteration.
>
>
> Actually, the copying is outside of the greedy CFG walk loop. I guess there
> will be some optimizations in memcpy bb_incoming_edges by libraries? But
> maybe only works when most of the value are same then copy_by_pieces?
>
> And I modified the time-consuming case with many bbs to simulate the extreme
> usage, didn't observe quadratic behavior from it, it shows almost linear
> build time increase for both base and V3 patch, but V3 patch generates more
> accurate ALWAYS_EXECUTED result. The function ratio in perf tool also looks
> not bad. At least, it looks like an improvement compared with the base code.
>
> A function with 10K bbs would cost more than 8 minutes to build, so it seems
> duplicating bb_incoming_edges much less then 1000*1000*1000 number of bbs will
> not be a performance bottle neck? (While most values are different, the
> copying
> will exactly slow down when bb_incoming_edges array is quite large like
> 1000*1000*1000,
> if so, that's really crazy automatic program generators...)
>
> Base:
> number of bbs(lim2+lim4) total build time (sec) tree loop invariant
> motion (usr/sys/wall/Mem)
> 7850+15615 123.99 13.52 ( 11%) 2.16 ( 52%) 15.68 ( 11%) 60M ( 21%)
> 15338+30591 208.98 24.55 ( 12%) 4.04 ( 52%) 28.57 ( 12%) 121M ( 21%)
> 30314+60543 511.54 48.96 ( 10%) 7.70 ( 52%) 56.90 ( 10%) 241M ( 21%)
>
> V3 patch:
> number of bbs(lim2+lim4) total build time (sec) tree loop invariant
> motion (usr/sys/wall/Mem)
> 7850+15615 121.18 12.26 ( 10%) 1.99 ( 49%) 14.42 ( 11%) 60M ( 21%)
> 15338+30591 204.99 24.72 ( 12%) 3.98 ( 52%) 28.85 ( 13%) 121M ( 21%)
> 30314+60543 500.76 48.26 ( 10%) 8.27 ( 51%) 57.12 ( 11%) 241M ( 21%)
>
>
> Samples: 455K of event 'cycles', Event count (approx.): 426772700334
> Overhead Command Shared Object Symbol
> 7.19% cc1 cc1 [.] bitmap_set_bit
> 5.86% cc1 cc1 [.] get_ref_base_and_extent
> 4.08% cc1 cc1 [.] get_continuation_for_phi
> 3.48% cc1 cc1 [.] hash_table<vn_ssa_aux_hasher, false,
> xcallocator>::find
> 3.41% cc1 cc1 [.] dominated_by_p
> 3.28% cc1 cc1 [.] compute_transp
> 2.98% cc1 cc1 [.] bitmap_set_range
> 2.60% cc1 cc1 [.] bitmap_list_insert_element_after
> 2.44% cc1 cc1 [.] stmt_may_clobber_ref_p_1
> 2.27% cc1 cc1 [.] refs_may_alias_p_2
> 2.26% cc1 cc1 [.] hash_table<vn_reference_hasher, false,
> xcallocator>::fi
> 2.26% cc1 cc1 [.] df_rd_transfer_function
> 2.07% cc1 cc1 [.] bitmap_ior_into
> 1.84% cc1 cc1 [.] find_base_term
> 1.82% cc1 cc1 [.] get_alias_set
> 1.79% cc1 cc1 [.] tree_strip_nop_conversions
> 1.73% cc1 cc1 [.] dfs_enumerate_from
> 1.73% cc1 cc1 [.] reference_alias_ptr_type_1
> 1.53% cc1 cc1 [.] alias_sets_conflict_p
> 1.30% cc1 cc1 [.] c_get_alias_set
> 1.16% cc1 cc1 [.] bitmap_and_into
> 1.12% cc1 cc1 [.] indirect_ref_may_alias_decl_p
> 1.04% cc1 cc1 [.] et_splay
> 0.99% cc1 libc-2.27.so [.] __memset_power8
> 0.95% cc1 cc1 [.] bitmap_ior_and_compl
> 0.86% cc1 cc1 [.] bitmap_copy
> 0.78% cc1 cc1 [.] ao_ref_from_mem
>
> >
> > I wonder if we can more cleverly handle this by picking up
> > conditionally executed inner loops on the fly for example
> > by recording a 'outermost unconditionally executed loop'
> > for each loop header we visit and using that like
> > SET_ALWAYS_EXECUTED_IN (bb, outermost_always_exec_in (bb->loop_father))
> >
> > So make this a single greedy CFG walk starting at the outermost
> > loops. loop exists will need to be processed specially then.
>
> Not sure do you mean we only check loop headers and exits? As said above,
> the ALWAYS_EXECUTED block chains are not successive, there may be blocks
> are ALWAYS_EXECUTED or goto-out-loops in the middle of the loop, but they
> could only have two chains? One is from loop->header, and another is from
> loop->exit->src for loops has one one exit?
For setting ALWAYS_EXECUTED we rely on dominance. The greedy walk
should determine whether there are paths in the CFG that skip
execution of the block. I made the assumption
in that if there exist "uninterrupted" paths covering all incoming
edges of a block that this ensures the block is always executed.
And since we only trace such "always executed" blocks this holds
transitively.
I'll see if I can find some time this week to spend some time on this
problem and maybe also create a testcase for the existing issue
with contains_call and the dominator order of BB processing.
Richard.