On Fri, 5 Nov 2021, Tamar Christina wrote: > > > > -----Original Message----- > > From: Richard Biener <rguent...@suse.de> > > Sent: Tuesday, November 2, 2021 6:22 PM > > To: Richard Sandiford <richard.sandif...@arm.com> > > Cc: Richard Biener via Gcc-patches <gcc-patches@gcc.gnu.org>; Tamar > > Christina <tamar.christ...@arm.com>; nd <n...@arm.com> > > Subject: Re: [PATCH]middle-end Add an RPO pass after successful > > vectorization > > > > On Tue, 2 Nov 2021, Richard Sandiford wrote: > > > > > Richard Biener via Gcc-patches <gcc-patches@gcc.gnu.org> writes: > > > > On Tue, 2 Nov 2021, Tamar Christina wrote: > > > > > > > >> > -----Original Message----- > > > >> > From: Richard Biener <rguent...@suse.de> > > > >> > Sent: Tuesday, November 2, 2021 2:24 PM > > > >> > To: Tamar Christina <tamar.christ...@arm.com> > > > >> > Cc: gcc-patches@gcc.gnu.org; nd <n...@arm.com> > > > >> > Subject: Re: [PATCH]middle-end Add an RPO pass after successful > > > >> > vectorization > > > >> > > > > >> > On Tue, 2 Nov 2021, Tamar Christina wrote: > > > >> > > > > >> > > Hi All, > > > >> > > > > > >> > > Following my current SVE predicate optimization series a > > > >> > > problem has presented itself in that the way vector masks are > > > >> > > generated for masked operations relies on CSE to share masks > > efficiently. > > > >> > > > > > >> > > The issue however is that masking is done using the & operand > > > >> > > and & is associative and so reassoc decides to reassociate the > > masked operations. > > > >> > > > > >> > But it does this for the purpose of canonicalization and thus CSE. > > > >> > > > >> Yes, but it turns something like > > > >> > > > >> (a & b) & mask into a & (b & mask). > > > >> > > > >> When (a & b) is used somewhere else you now lose the CSE. So it's > > > >> actually hurting In this case. > > > > > > > > OK, so that's a known "issue" with reassoc, it doesn't consider > > > > global CSE opportunities and I guess it pushes 'mask' to leaf if it > > > > is loop carried. > > > > > > > >> > > > > >> > > This makes CSE then unable to CSE an unmasked and a masked > > > >> > > operation leading to duplicate operations being performed. > > > >> > > > > > >> > > To counter this we want to add an RPO pass over the vectorized > > > >> > > loop body when vectorization succeeds. This makes it then no > > > >> > > longer reliant on the RTL level CSE. > > > >> > > > > > >> > > I have not added a testcase for this as it requires the changes > > > >> > > in my patch series, however the entire series relies on this > > > >> > > patch to work so all the tests there cover it. > > > >> > > > > > >> > > Bootstrapped Regtested on aarch64-none-linux-gnu, > > > >> > > x86_64-linux-gnu and no issues. > > > >> > > > > > >> > > Ok for master? > > > >> > > > > >> > You are running VN over _all_ loop bodies rather only those > > vectorized. > > > >> > We loop over vectorized loops earlier for optimizing masked store > > sequences. > > > >> > I suppose you could hook in there. I'll also notice that we have > > > >> > pass_pre_slp_scalar_cleanup which eventually runs plus we have a > > late FRE. > > > >> > So I don't understand why it doesn't work to CSE later. > > > >> > > > > >> > > > >> Atm, say you have the conditions a > b, and a > b & a > c > > > >> > > > >> We generate > > > >> > > > >> mask1 = (a > b) & loop_mask > > > >> mask2 = (a > b & a > c) & loop_mask > > > >> > > > >> with the intention that mask1 can be re-used in mask2. > > > >> > > > >> Reassoc changes this to mask2 = a > b & (a > c & loop_mask) > > > >> > > > >> Which has now unmasked (a > b) in mask2, which leaves us unable to > > > >> combine the mask1 and mask2. It doesn't generate incorrect code, just > > inefficient. > > > >> > > > >> > for (i = 1; i < number_of_loops (cfun); i++) > > > >> > { > > > >> > loop_vec_info loop_vinfo; > > > >> > bool has_mask_store; > > > >> > > > > >> > loop = get_loop (cfun, i); > > > >> > if (!loop || !loop->aux) > > > >> > continue; > > > >> > loop_vinfo = (loop_vec_info) loop->aux; > > > >> > has_mask_store = LOOP_VINFO_HAS_MASK_STORE (loop_vinfo); > > > >> > delete loop_vinfo; > > > >> > if (has_mask_store > > > >> > && targetm.vectorize.empty_mask_is_expensive > > (IFN_MASK_STORE)) > > > >> > optimize_mask_stores (loop); > > > >> > loop->aux = NULL; > > > >> > } > > > >> > > > > >> > > > >> Ah thanks, I'll make the changes. > > > > > > > > Note I think that full-blown CSE is a bit overkill just to counter a > > > > deficient reassoc (or VN). At least it is supposed to be "cheap" > > > > and can be conditionalized on loop masks being used as well. > > > > > > Not sure we should make this conditional on loop masks being used. > > > It seems either that: > > > > > > (a) the vectoriser is supposed to avoid creating code that has folding > > > or VN opportunities, in which case we need to generate the vectorised > > > code in a smarter way or > > > > > > (b) the vectoriser is allowed to create code that has folding or VN > > > opportunities, in which case it would be good to have a defined > > > place to get rid of them. > > > > It's certainly (b), and the definitive place to get rid of those is the > > post-loop > > optimizer FRE pass. That just happens to be after a reassoc pass which > > makes FRE run into the pre-existing issue that we fail to capture all (or > > the > > best) possible CSE opportunity from separate associatable chains. > > > > > I'm just worried that if we make it conditional on loop masks, we > > > could see cases that in which non-loop-mask stuff is optimised > > > differently based on whether the loop has masks or not. E.g. > > > we might get worse code with an unpredicated main loop and a > > > predicated epilogue compared to a predicated main loop. > > > > Sure. Note for loop vectorization we can indeed reasonably easy CSE the > > main body and RPO VN should be O(region size) and cheap enough for this > > case (we could even add an extra cheap entry for single basic-blocks). For > > BB > > vectorization we have to rely on the full function FRE pass though. > > > > I've moved the code so it works only on the vector loops. I don't think there > needs > to be any code change to handle single BB efficiently no? the walk stop at > the loop > exit block so it's already optimal in the case, unless I misunderstood?
Yeah, it's at most some constant time work we could avoid. > Bootstrapped Regtested on aarch64-none-linux-gnu, > x86_64-linux-gnu and no issues. > > Ok for master? See below... > Thanks, > Tamar > > gcc/ChangeLog: > > * tree-vectorizer.c (vectorize_loops): Do local CSE through RPVN upon > successful vectorization. > > --- inline copy of patch --- > > diff --git a/gcc/tree-vectorizer.c b/gcc/tree-vectorizer.c > index > edb7538a67f00cd80a608ee82510cf437fe88083..9d8015ea35984963db1ddb90af32935d08c8a0e8 > 100644 > --- a/gcc/tree-vectorizer.c > +++ b/gcc/tree-vectorizer.c > @@ -81,7 +81,7 @@ along with GCC; see the file COPYING3. If not see > #include "gimple-pretty-print.h" > #include "opt-problem.h" > #include "internal-fn.h" > - > +#include "tree-ssa-sccvn.h" > > /* Loop or bb location, with hotness information. */ > dump_user_location_t vect_location; > @@ -1298,6 +1298,20 @@ vectorize_loops (void) > if (has_mask_store > && targetm.vectorize.empty_mask_is_expensive (IFN_MASK_STORE)) > optimize_mask_stores (loop); > + > + bitmap exit_bbs; > + /* Perform local CSE, this esp. helps because we emit code for > + predicates that need to be shared for optimal predicate usage. > + However reassoc will re-order them and prevent CSE from working > + as it should. CSE only the loop body, not the entry. */ > + exit_bbs = BITMAP_ALLOC (NULL); use auto_bitmap exit_bbs; and the allocation and ... > + bitmap_set_bit (exit_bbs, single_exit (loop)->dest->index); > + bitmap_set_bit (exit_bbs, loop->latch->index); treating the latch as exit is probably premature optimization (yes, it's empty). > + > + do_rpo_vn (cfun, loop_preheader_edge (loop), exit_bbs); > + > + BITMAP_FREE (exit_bbs); ... deallocation can go. Note I wonder whether, if we are already spinning up VN, we should include the preheader in the operation? We regularly end up emitting redundant vector initializers that could be cleaned up earlier this way. Otherwise the change looks OK. Thanks, Richard.