On Tue, 9 Jan 2024, Tamar Christina wrote:
> > This makes it quadratic in the number of vectorized early exit loops
> > in a function. The vectorizer CFG manipulation operates in a local
> > enough bubble that programmatic updating of dominators should be
> > possible (after all we manage to produce correct SSA form!), the
> > proposed change gets us too far off to a point where re-computating
> > dominance info is likely cheaper (but no, we shouldn't do this either).
> >
> > Can you instead give manual updating a try again? I think
> > versioning should produce up-to-date dominator info, it's only
> > when you redirect branches during peeling that you'd need
> > adjustments - but IIRC we're never introducing new merges?
> >
> > IIRC we can't wipe dominators during transform since we query them
> > during code generation. We possibly could code generate all
> > CFG manipulations of all vectorized loops, recompute all dominators
> > and then do code generation of all vectorized loops.
> >
> > But then we're doing a loop transform and the exits will ultimatively
> > end up in the same place, so the CFG and dominator update is bound to
> > where the original exits went to.
>
> Yeah that's a fair point, the issue is specifically with at_exit. So how
> about:
>
> When we peel at_exit we are moving the new loop at the exit of the previous
> loop. This means that the blocks outside the loop dat the previous loop used
> to
> dominate are no longer being dominated by it.
Hmm, indeed. Note this does make the dominator update O(function-size)
and when vectorizing multiple loops in a function this becomes
quadratic. That's quite unfortunate so I wonder if we can delay the
update to the parts we do not need up-to-date dominators during
vectorization (of course it gets fragile with having only partly
correct dominators).
> The new dominators however are hard to predict since if the loop has multiple
> exits and all the exits are an "early" one then we always execute the scalar
> loop. In this case the scalar loop can completely dominate the new loop.
>
> If we later have skip_vector then there's an additional skip edge added that
> might change the dominators.
>
> The previous patch would force an update of all blocks reachable from the new
> exits. This one updates *only* blocks that we know the scalar exits
> dominated.
>
> For the examples this reduces the blocks to update from 18 to 3.
>
> Bootstrapped Regtested on aarch64-none-linux-gnu, x86_64-pc-linux-gnu
> and no issues normally and with --enable-checking=release --enable-lto
> --with-build-config=bootstrap-O3 --enable-checking=yes,rtl,extra.
>
> Ok for master?
See below.
> Thanks,
> Tamar
>
> gcc/ChangeLog:
>
> PR tree-optimization/113144
> PR tree-optimization/113145
> * tree-vect-loop-manip.cc (slpeel_tree_duplicate_loop_to_edge_cfg):
> Update all BB that the original exits dominated.
>
> gcc/testsuite/ChangeLog:
>
> PR tree-optimization/113144
> PR tree-optimization/113145
> * gcc.dg/vect/vect-early-break_94-pr113144.c: New test.
>
> --- inline copy of patch ---
>
> diff --git a/gcc/testsuite/gcc.dg/vect/vect-early-break_94-pr113144.c
> b/gcc/testsuite/gcc.dg/vect/vect-early-break_94-pr113144.c
> new file mode 100644
> index
> 0000000000000000000000000000000000000000..903fe7be6621e81db6f29441e4309fa213d027c5
> --- /dev/null
> +++ b/gcc/testsuite/gcc.dg/vect/vect-early-break_94-pr113144.c
> @@ -0,0 +1,41 @@
> +/* { dg-do compile } */
> +/* { dg-add-options vect_early_break } */
> +/* { dg-require-effective-target vect_early_break } */
> +/* { dg-require-effective-target vect_int } */
> +
> +/* { dg-final { scan-tree-dump "LOOP VECTORIZED" "vect" } } */
> +
> +long tar_atol256_max, tar_atol256_size, tar_atosl_min;
> +char tar_atol256_s;
> +void __errno_location();
> +
> +
> +inline static long tar_atol256(long min) {
> + char c;
> + int sign;
> + c = tar_atol256_s;
> + sign = c;
> + while (tar_atol256_size) {
> + if (c != sign)
> + return sign ? min : tar_atol256_max;
> + c = tar_atol256_size--;
> + }
> + if ((c & 128) != (sign & 128))
> + return sign ? min : tar_atol256_max;
> + return 0;
> +}
> +
> +inline static long tar_atol(long min) {
> + return tar_atol256(min);
> +}
> +
> +long tar_atosl() {
> + long n = tar_atol(-1);
> + if (tar_atosl_min) {
> + __errno_location();
> + return 0;
> + }
> + if (n > 0)
> + return 0;
> + return n;
> +}
> diff --git a/gcc/tree-vect-loop-manip.cc b/gcc/tree-vect-loop-manip.cc
> index
> 76d4979c0b3b374dcaacf6825a95a8714114a63b..9bacaa182a3919cae1cb99dfc5ae4923e1f93376
> 100644
> --- a/gcc/tree-vect-loop-manip.cc
> +++ b/gcc/tree-vect-loop-manip.cc
> @@ -1719,8 +1719,6 @@ slpeel_tree_duplicate_loop_to_edge_cfg (class loop
> *loop, edge loop_exit,
> /* Now link the alternative exits. */
> if (multiple_exits_p)
> {
> - set_immediate_dominator (CDI_DOMINATORS, new_preheader,
> - main_loop_exit_block);
> for (auto gsi_from = gsi_start_phis (loop->header),
> gsi_to = gsi_start_phis (new_preheader);
> !gsi_end_p (gsi_from) && !gsi_end_p (gsi_to);
> @@ -1776,7 +1774,14 @@ slpeel_tree_duplicate_loop_to_edge_cfg (class loop
> *loop, edge loop_exit,
> {
> update_loop = new_loop;
> for (edge e : get_loop_exit_edges (loop))
> - doms.safe_push (e->dest);
> + {
> + /* Basic blocks that the old loop dominated are now dominated by
> + the new loop and so we have to update those. */
> + for (auto bb : get_all_dominated_blocks (CDI_DOMINATORS, e->src))
> + if (!flow_bb_inside_loop_p (loop, bb))
> + doms.safe_push (bb);
> + doms.safe_push (e->dest);
> + }
I think you'll get duplicate blocks that way. Maybe simplify this
all by instead doing
auto doms = get_all_dominated_blocks (CDI_DOMINATORS, loop->header);
for (unsigned i = 0; i < doms.length (); ++i)
if (flow_bb_inside_loop_p (loop, doms[i]))
doms.unordered_remove (i);
?
OK with that change, but really we should see to avoid this
quadraticness :/ It's probably not too bad right now given we have
quite some restrictions on vectorizing loops with multiple exits,
but I suggest you try an artificial testcase with the "same"
loop repeated N times to see whether dominance compute creeps up
in the profile.
Richard.