On Tue, 2 Nov 2021, Tamar Christina wrote: > > -----Original Message----- > > From: Richard Biener <rguent...@suse.de> > > Sent: Tuesday, November 2, 2021 2:24 PM > > To: Tamar Christina <tamar.christ...@arm.com> > > Cc: gcc-patches@gcc.gnu.org; nd <n...@arm.com> > > Subject: Re: [PATCH]middle-end Add an RPO pass after successful > > vectorization > > > > On Tue, 2 Nov 2021, Tamar Christina wrote: > > > > > Hi All, > > > > > > Following my current SVE predicate optimization series a problem has > > > presented itself in that the way vector masks are generated for masked > > > operations relies on CSE to share masks efficiently. > > > > > > The issue however is that masking is done using the & operand and & is > > > associative and so reassoc decides to reassociate the masked operations. > > > > But it does this for the purpose of canonicalization and thus CSE. > > Yes, but it turns something like > > (a & b) & mask into a & (b & mask). > > When (a & b) is used somewhere else you now lose the CSE. So it's actually > hurting > In this case.
OK, so that's a known "issue" with reassoc, it doesn't consider global CSE opportunities and I guess it pushes 'mask' to leaf if it is loop carried. > > > > > This makes CSE then unable to CSE an unmasked and a masked operation > > > leading to duplicate operations being performed. > > > > > > To counter this we want to add an RPO pass over the vectorized loop > > > body when vectorization succeeds. This makes it then no longer > > > reliant on the RTL level CSE. > > > > > > I have not added a testcase for this as it requires the changes in my > > > patch series, however the entire series relies on this patch to work > > > so all the tests there cover it. > > > > > > Bootstrapped Regtested on aarch64-none-linux-gnu, x86_64-linux-gnu and > > > no issues. > > > > > > Ok for master? > > > > You are running VN over _all_ loop bodies rather only those vectorized. > > We loop over vectorized loops earlier for optimizing masked store sequences. > > I suppose you could hook in there. I'll also notice that we have > > pass_pre_slp_scalar_cleanup which eventually runs plus we have a late FRE. > > So I don't understand why it doesn't work to CSE later. > > > > Atm, say you have the conditions a > b, and a > b & a > c > > We generate > > mask1 = (a > b) & loop_mask > mask2 = (a > b & a > c) & loop_mask > > with the intention that mask1 can be re-used in mask2. > > Reassoc changes this to mask2 = a > b & (a > c & loop_mask) > > Which has now unmasked (a > b) in mask2, which leaves us unable to combine > the mask1 and mask2. It doesn't generate incorrect code, just inefficient. > > > for (i = 1; i < number_of_loops (cfun); i++) > > { > > loop_vec_info loop_vinfo; > > bool has_mask_store; > > > > loop = get_loop (cfun, i); > > if (!loop || !loop->aux) > > continue; > > loop_vinfo = (loop_vec_info) loop->aux; > > has_mask_store = LOOP_VINFO_HAS_MASK_STORE (loop_vinfo); > > delete loop_vinfo; > > if (has_mask_store > > && targetm.vectorize.empty_mask_is_expensive (IFN_MASK_STORE)) > > optimize_mask_stores (loop); > > loop->aux = NULL; > > } > > > > Ah thanks, I'll make the changes. Note I think that full-blown CSE is a bit overkill just to counter a deficient reassoc (or VN). At least it is supposed to be "cheap" and can be conditionalized on loop masks being used as well. > Thanks, > Tamar > > > > > > Thanks, > > > Tamar > > > > > > gcc/ChangeLog: > > > > > > * tree-vectorizer.c (vectorize_loops): Do local CSE through RPVN > > upon > > > successful vectorization. > > > > > > --- inline copy of patch -- > > > diff --git a/gcc/tree-vectorizer.c b/gcc/tree-vectorizer.c index > > > > > 4712dc6e7f907637774482a71036a0bd381c2bd2..1e370d60fb19b03c3b6bce45c > > 660 > > > af4b6d32dc51 100644 > > > --- a/gcc/tree-vectorizer.c > > > +++ b/gcc/tree-vectorizer.c > > > @@ -81,7 +81,7 @@ along with GCC; see the file COPYING3. If not see > > > #include "gimple-pretty-print.h" > > > #include "opt-problem.h" > > > #include "internal-fn.h" > > > - > > > +#include "tree-ssa-sccvn.h" > > > > > > /* Loop or bb location, with hotness information. */ > > > dump_user_location_t vect_location; @@ -1323,6 +1323,27 @@ > > > vectorize_loops (void) > > > ??? Also while we try hard to update loop-closed SSA form we fail > > > to properly do this in some corner-cases (see PR56286). */ > > > rewrite_into_loop_closed_ssa (NULL, > > > TODO_update_ssa_only_virtuals); > > > + > > > + for (i = 1; i < number_of_loops (cfun); i++) > > > + { > > > + loop = get_loop (cfun, i); > > > + if (!loop || !single_exit (loop)) > > > + continue; > > > + > > > + bitmap exit_bbs; > > > + /* Perform local CSE, this esp. helps because we emit code for > > > + predicates that need to be shared for optimal predicate usage. > > > + However reassoc will re-order them and prevent CSE from working > > > + as it should. CSE only the loop body, not the entry. */ > > > + exit_bbs = BITMAP_ALLOC (NULL); > > > + bitmap_set_bit (exit_bbs, single_exit (loop)->dest->index); > > > + bitmap_set_bit (exit_bbs, loop->latch->index); > > > + > > > + do_rpo_vn (cfun, loop_preheader_edge (loop), exit_bbs); > > > + > > > + BITMAP_FREE (exit_bbs); > > > + } > > > + > > > return TODO_cleanup_cfg; > > > } > > > > > > > > > > > > > > > > -- > > Richard Biener <rguent...@suse.de> > > SUSE Software Solutions Germany GmbH, Maxfeldstrasse 5, 90409 > > Nuernberg, Germany; GF: Ivo Totev; HRB 36809 (AG Nuernberg) > -- Richard Biener <rguent...@suse.de> SUSE Software Solutions Germany GmbH, Maxfeldstrasse 5, 90409 Nuernberg, Germany; GF: Ivo Totev; HRB 36809 (AG Nuernberg)