RE: [PATCH]middle-end Add an RPO pass after successful vectorization

Tamar Christina via Gcc-patches Fri, 05 Nov 2021 04:56:35 -0700


> -----Original Message-----
> From: Richard Biener <[email protected]>
> Sent: Tuesday, November 2, 2021 6:22 PM
> To: Richard Sandiford <[email protected]>
> Cc: Richard Biener via Gcc-patches <[email protected]>; Tamar
> Christina <[email protected]>; nd <[email protected]>
> Subject: Re: [PATCH]middle-end Add an RPO pass after successful
> vectorization
> 
> On Tue, 2 Nov 2021, Richard Sandiford wrote:
> 
> > Richard Biener via Gcc-patches <[email protected]> writes:
> > > On Tue, 2 Nov 2021, Tamar Christina wrote:
> > >
> > >> > -----Original Message-----
> > >> > From: Richard Biener <[email protected]>
> > >> > Sent: Tuesday, November 2, 2021 2:24 PM
> > >> > To: Tamar Christina <[email protected]>
> > >> > Cc: [email protected]; nd <[email protected]>
> > >> > Subject: Re: [PATCH]middle-end Add an RPO pass after successful
> > >> > vectorization
> > >> >
> > >> > On Tue, 2 Nov 2021, Tamar Christina wrote:
> > >> >
> > >> > > Hi All,
> > >> > >
> > >> > > Following my current SVE predicate optimization series a
> > >> > > problem has presented itself in that the way vector masks are
> > >> > > generated for masked operations relies on CSE to share masks
> efficiently.
> > >> > >
> > >> > > The issue however is that masking is done using the & operand
> > >> > > and & is associative and so reassoc decides to reassociate the
> masked operations.
> > >> >
> > >> > But it does this for the purpose of canonicalization and thus CSE.
> > >>
> > >> Yes, but it turns something like
> > >>
> > >> (a & b) & mask into a & (b & mask).
> > >>
> > >> When (a & b) is used somewhere else you now lose the CSE.  So it's
> > >> actually hurting In this case.
> > >
> > > OK, so that's a known "issue" with reassoc, it doesn't consider
> > > global CSE opportunities and I guess it pushes 'mask' to leaf if it
> > > is loop carried.
> > >
> > >> >
> > >> > > This makes CSE then unable to CSE an unmasked and a masked
> > >> > > operation leading to duplicate operations being performed.
> > >> > >
> > >> > > To counter this we want to add an RPO pass over the vectorized
> > >> > > loop body when vectorization succeeds.  This makes it then no
> > >> > > longer reliant on the RTL level CSE.
> > >> > >
> > >> > > I have not added a testcase for this as it requires the changes
> > >> > > in my patch series, however the entire series relies on this
> > >> > > patch to work so all the tests there cover it.
> > >> > >
> > >> > > Bootstrapped Regtested on aarch64-none-linux-gnu,
> > >> > > x86_64-linux-gnu and no issues.
> > >> > >
> > >> > > Ok for master?
> > >> >
> > >> > You are running VN over _all_ loop bodies rather only those
> vectorized.
> > >> > We loop over vectorized loops earlier for optimizing masked store
> sequences.
> > >> > I suppose you could hook in there.  I'll also notice that we have
> > >> > pass_pre_slp_scalar_cleanup which eventually runs plus we have a
> late FRE.
> > >> > So I don't understand why it doesn't work to CSE later.
> > >> >
> > >>
> > >> Atm, say you have the conditions a > b, and a > b & a > c
> > >>
> > >> We generate
> > >>
> > >> mask1 = (a > b) & loop_mask
> > >> mask2 = (a > b & a > c) & loop_mask
> > >>
> > >> with the intention that mask1 can be re-used in mask2.
> > >>
> > >> Reassoc changes this to mask2 = a > b & (a > c & loop_mask)
> > >>
> > >> Which has now unmasked (a > b) in mask2, which leaves us unable to
> > >> combine the mask1 and mask2.  It doesn't generate incorrect code, just
> inefficient.
> > >>
> > >> >   for (i = 1; i < number_of_loops (cfun); i++)
> > >> >     {
> > >> >       loop_vec_info loop_vinfo;
> > >> >       bool has_mask_store;
> > >> >
> > >> >       loop = get_loop (cfun, i);
> > >> >       if (!loop || !loop->aux)
> > >> >         continue;
> > >> >       loop_vinfo = (loop_vec_info) loop->aux;
> > >> >       has_mask_store = LOOP_VINFO_HAS_MASK_STORE (loop_vinfo);
> > >> >       delete loop_vinfo;
> > >> >       if (has_mask_store
> > >> >           && targetm.vectorize.empty_mask_is_expensive
> (IFN_MASK_STORE))
> > >> >         optimize_mask_stores (loop);
> > >> >       loop->aux = NULL;
> > >> >     }
> > >> >
> > >>
> > >> Ah thanks, I'll make the changes.
> > >
> > > Note I think that full-blown CSE is a bit overkill just to counter a
> > > deficient reassoc (or VN).  At least it is supposed to be "cheap"
> > > and can be conditionalized on loop masks being used as well.
> >
> > Not sure we should make this conditional on loop masks being used.
> > It seems either that:
> >
> > (a) the vectoriser is supposed to avoid creating code that has folding
> >     or VN opportunities, in which case we need to generate the vectorised
> >     code in a smarter way or
> >
> > (b) the vectoriser is allowed to create code that has folding or VN
> >     opportunities, in which case it would be good to have a defined
> >     place to get rid of them.
> 
> It's certainly (b), and the definitive place to get rid of those is the 
> post-loop
> optimizer FRE pass.  That just happens to be after a reassoc pass which
> makes FRE run into the pre-existing issue that we fail to capture all (or the
> best) possible CSE opportunity from separate associatable chains.
> 
> > I'm just worried that if we make it conditional on loop masks, we
> > could see cases that in which non-loop-mask stuff is optimised
> > differently based on whether the loop has masks or not.  E.g.
> > we might get worse code with an unpredicated main loop and a
> > predicated epilogue compared to a predicated main loop.
> 
> Sure.  Note for loop vectorization we can indeed reasonably easy CSE the
> main body and RPO VN should be O(region size) and cheap enough for this
> case (we could even add an extra cheap entry for single basic-blocks).  For BB
> vectorization we have to rely on the full function FRE pass though.
>


I've moved the code so it works only on the vector loops. I don't think there 
needs
to be any code change to handle single BB efficiently no? the walk stop at the 
loop
exit block so it's already optimal in the case, unless I misunderstood?

Bootstrapped Regtested on aarch64-none-linux-gnu,
x86_64-linux-gnu and no issues.

Ok for master?

Thanks,
Tamar

gcc/ChangeLog:

        * tree-vectorizer.c (vectorize_loops): Do local CSE through RPVN upon
        successful vectorization.

--- inline copy of patch ---

diff --git a/gcc/tree-vectorizer.c b/gcc/tree-vectorizer.c
index 
edb7538a67f00cd80a608ee82510cf437fe88083..9d8015ea35984963db1ddb90af32935d08c8a0e8
 100644
--- a/gcc/tree-vectorizer.c
+++ b/gcc/tree-vectorizer.c
@@ -81,7 +81,7 @@ along with GCC; see the file COPYING3.  If not see
 #include "gimple-pretty-print.h"
 #include "opt-problem.h"
 #include "internal-fn.h"
-
+#include "tree-ssa-sccvn.h"
 
 /* Loop or bb location, with hotness information.  */
 dump_user_location_t vect_location;
@@ -1298,6 +1298,20 @@ vectorize_loops (void)
       if (has_mask_store
          && targetm.vectorize.empty_mask_is_expensive (IFN_MASK_STORE))
        optimize_mask_stores (loop);
+
+      bitmap exit_bbs;
+      /* Perform local CSE, this esp. helps because we emit code for
+        predicates that need to be shared for optimal predicate usage.
+        However reassoc will re-order them and prevent CSE from working
+        as it should.  CSE only the loop body, not the entry.  */
+      exit_bbs = BITMAP_ALLOC (NULL);
+      bitmap_set_bit (exit_bbs, single_exit (loop)->dest->index);
+      bitmap_set_bit (exit_bbs, loop->latch->index);
+
+      do_rpo_vn (cfun, loop_preheader_edge (loop), exit_bbs);
+
+      BITMAP_FREE (exit_bbs);
+
       loop->aux = NULL;
     }

rb15007.patch
Description: rb15007.patch

RE: [PATCH]middle-end Add an RPO pass after successful vectorization

Reply via email to