> -----Original Message----- > From: Richard Biener <rguent...@suse.de> > Sent: Tuesday, November 2, 2021 6:22 PM > To: Richard Sandiford <richard.sandif...@arm.com> > Cc: Richard Biener via Gcc-patches <gcc-patches@gcc.gnu.org>; Tamar > Christina <tamar.christ...@arm.com>; nd <n...@arm.com> > Subject: Re: [PATCH]middle-end Add an RPO pass after successful > vectorization > > On Tue, 2 Nov 2021, Richard Sandiford wrote: > > > Richard Biener via Gcc-patches <gcc-patches@gcc.gnu.org> writes: > > > On Tue, 2 Nov 2021, Tamar Christina wrote: > > > > > >> > -----Original Message----- > > >> > From: Richard Biener <rguent...@suse.de> > > >> > Sent: Tuesday, November 2, 2021 2:24 PM > > >> > To: Tamar Christina <tamar.christ...@arm.com> > > >> > Cc: gcc-patches@gcc.gnu.org; nd <n...@arm.com> > > >> > Subject: Re: [PATCH]middle-end Add an RPO pass after successful > > >> > vectorization > > >> > > > >> > On Tue, 2 Nov 2021, Tamar Christina wrote: > > >> > > > >> > > Hi All, > > >> > > > > >> > > Following my current SVE predicate optimization series a > > >> > > problem has presented itself in that the way vector masks are > > >> > > generated for masked operations relies on CSE to share masks > efficiently. > > >> > > > > >> > > The issue however is that masking is done using the & operand > > >> > > and & is associative and so reassoc decides to reassociate the > masked operations. > > >> > > > >> > But it does this for the purpose of canonicalization and thus CSE. > > >> > > >> Yes, but it turns something like > > >> > > >> (a & b) & mask into a & (b & mask). > > >> > > >> When (a & b) is used somewhere else you now lose the CSE. So it's > > >> actually hurting In this case. > > > > > > OK, so that's a known "issue" with reassoc, it doesn't consider > > > global CSE opportunities and I guess it pushes 'mask' to leaf if it > > > is loop carried. > > > > > >> > > > >> > > This makes CSE then unable to CSE an unmasked and a masked > > >> > > operation leading to duplicate operations being performed. > > >> > > > > >> > > To counter this we want to add an RPO pass over the vectorized > > >> > > loop body when vectorization succeeds. This makes it then no > > >> > > longer reliant on the RTL level CSE. > > >> > > > > >> > > I have not added a testcase for this as it requires the changes > > >> > > in my patch series, however the entire series relies on this > > >> > > patch to work so all the tests there cover it. > > >> > > > > >> > > Bootstrapped Regtested on aarch64-none-linux-gnu, > > >> > > x86_64-linux-gnu and no issues. > > >> > > > > >> > > Ok for master? > > >> > > > >> > You are running VN over _all_ loop bodies rather only those > vectorized. > > >> > We loop over vectorized loops earlier for optimizing masked store > sequences. > > >> > I suppose you could hook in there. I'll also notice that we have > > >> > pass_pre_slp_scalar_cleanup which eventually runs plus we have a > late FRE. > > >> > So I don't understand why it doesn't work to CSE later. > > >> > > > >> > > >> Atm, say you have the conditions a > b, and a > b & a > c > > >> > > >> We generate > > >> > > >> mask1 = (a > b) & loop_mask > > >> mask2 = (a > b & a > c) & loop_mask > > >> > > >> with the intention that mask1 can be re-used in mask2. > > >> > > >> Reassoc changes this to mask2 = a > b & (a > c & loop_mask) > > >> > > >> Which has now unmasked (a > b) in mask2, which leaves us unable to > > >> combine the mask1 and mask2. It doesn't generate incorrect code, just > inefficient. > > >> > > >> > for (i = 1; i < number_of_loops (cfun); i++) > > >> > { > > >> > loop_vec_info loop_vinfo; > > >> > bool has_mask_store; > > >> > > > >> > loop = get_loop (cfun, i); > > >> > if (!loop || !loop->aux) > > >> > continue; > > >> > loop_vinfo = (loop_vec_info) loop->aux; > > >> > has_mask_store = LOOP_VINFO_HAS_MASK_STORE (loop_vinfo); > > >> > delete loop_vinfo; > > >> > if (has_mask_store > > >> > && targetm.vectorize.empty_mask_is_expensive > (IFN_MASK_STORE)) > > >> > optimize_mask_stores (loop); > > >> > loop->aux = NULL; > > >> > } > > >> > > > >> > > >> Ah thanks, I'll make the changes. > > > > > > Note I think that full-blown CSE is a bit overkill just to counter a > > > deficient reassoc (or VN). At least it is supposed to be "cheap" > > > and can be conditionalized on loop masks being used as well. > > > > Not sure we should make this conditional on loop masks being used. > > It seems either that: > > > > (a) the vectoriser is supposed to avoid creating code that has folding > > or VN opportunities, in which case we need to generate the vectorised > > code in a smarter way or > > > > (b) the vectoriser is allowed to create code that has folding or VN > > opportunities, in which case it would be good to have a defined > > place to get rid of them. > > It's certainly (b), and the definitive place to get rid of those is the > post-loop > optimizer FRE pass. That just happens to be after a reassoc pass which > makes FRE run into the pre-existing issue that we fail to capture all (or the > best) possible CSE opportunity from separate associatable chains. > > > I'm just worried that if we make it conditional on loop masks, we > > could see cases that in which non-loop-mask stuff is optimised > > differently based on whether the loop has masks or not. E.g. > > we might get worse code with an unpredicated main loop and a > > predicated epilogue compared to a predicated main loop. > > Sure. Note for loop vectorization we can indeed reasonably easy CSE the > main body and RPO VN should be O(region size) and cheap enough for this > case (we could even add an extra cheap entry for single basic-blocks). For BB > vectorization we have to rely on the full function FRE pass though. >
I've moved the code so it works only on the vector loops. I don't think there needs to be any code change to handle single BB efficiently no? the walk stop at the loop exit block so it's already optimal in the case, unless I misunderstood? Bootstrapped Regtested on aarch64-none-linux-gnu, x86_64-linux-gnu and no issues. Ok for master? Thanks, Tamar gcc/ChangeLog: * tree-vectorizer.c (vectorize_loops): Do local CSE through RPVN upon successful vectorization. --- inline copy of patch --- diff --git a/gcc/tree-vectorizer.c b/gcc/tree-vectorizer.c index edb7538a67f00cd80a608ee82510cf437fe88083..9d8015ea35984963db1ddb90af32935d08c8a0e8 100644 --- a/gcc/tree-vectorizer.c +++ b/gcc/tree-vectorizer.c @@ -81,7 +81,7 @@ along with GCC; see the file COPYING3. If not see #include "gimple-pretty-print.h" #include "opt-problem.h" #include "internal-fn.h" - +#include "tree-ssa-sccvn.h" /* Loop or bb location, with hotness information. */ dump_user_location_t vect_location; @@ -1298,6 +1298,20 @@ vectorize_loops (void) if (has_mask_store && targetm.vectorize.empty_mask_is_expensive (IFN_MASK_STORE)) optimize_mask_stores (loop); + + bitmap exit_bbs; + /* Perform local CSE, this esp. helps because we emit code for + predicates that need to be shared for optimal predicate usage. + However reassoc will re-order them and prevent CSE from working + as it should. CSE only the loop body, not the entry. */ + exit_bbs = BITMAP_ALLOC (NULL); + bitmap_set_bit (exit_bbs, single_exit (loop)->dest->index); + bitmap_set_bit (exit_bbs, loop->latch->index); + + do_rpo_vn (cfun, loop_preheader_edge (loop), exit_bbs); + + BITMAP_FREE (exit_bbs); + loop->aux = NULL; }
rb15007.patch
Description: rb15007.patch