On Tue, 2 Nov 2021, Tamar Christina wrote:
> Hi All,
>
> Following my current SVE predicate optimization series a problem has presented
> itself in that the way vector masks are generated for masked operations relies
> on CSE to share masks efficiently.
>
> The issue however is that masking is done using the & operand and & is
> associative and so reassoc decides to reassociate the masked operations.
But it does this for the purpose of canonicalization and thus CSE.
> This makes CSE then unable to CSE an unmasked and a masked operation leading
> to
> duplicate operations being performed.
>
> To counter this we want to add an RPO pass over the vectorized loop body when
> vectorization succeeds. This makes it then no longer reliant on the RTL level
> CSE.
>
> I have not added a testcase for this as it requires the changes in my patch
> series, however the entire series relies on this patch to work so all the
> tests there cover it.
>
> Bootstrapped Regtested on aarch64-none-linux-gnu,
> x86_64-linux-gnu and no issues.
>
> Ok for master?
You are running VN over _all_ loop bodies rather only those vectorized.
We loop over vectorized loops earlier for optimizing masked store
sequences. I suppose you could hook in there. I'll also notice
that we have pass_pre_slp_scalar_cleanup which eventually runs
plus we have a late FRE. So I don't understand why it doesn't
work to CSE later.
for (i = 1; i < number_of_loops (cfun); i++)
{
loop_vec_info loop_vinfo;
bool has_mask_store;
loop = get_loop (cfun, i);
if (!loop || !loop->aux)
continue;
loop_vinfo = (loop_vec_info) loop->aux;
has_mask_store = LOOP_VINFO_HAS_MASK_STORE (loop_vinfo);
delete loop_vinfo;
if (has_mask_store
&& targetm.vectorize.empty_mask_is_expensive (IFN_MASK_STORE))
optimize_mask_stores (loop);
loop->aux = NULL;
}
> Thanks,
> Tamar
>
> gcc/ChangeLog:
>
> * tree-vectorizer.c (vectorize_loops): Do local CSE through RPVN upon
> successful vectorization.
>
> --- inline copy of patch --
> diff --git a/gcc/tree-vectorizer.c b/gcc/tree-vectorizer.c
> index
> 4712dc6e7f907637774482a71036a0bd381c2bd2..1e370d60fb19b03c3b6bce45c660af4b6d32dc51
> 100644
> --- a/gcc/tree-vectorizer.c
> +++ b/gcc/tree-vectorizer.c
> @@ -81,7 +81,7 @@ along with GCC; see the file COPYING3. If not see
> #include "gimple-pretty-print.h"
> #include "opt-problem.h"
> #include "internal-fn.h"
> -
> +#include "tree-ssa-sccvn.h"
>
> /* Loop or bb location, with hotness information. */
> dump_user_location_t vect_location;
> @@ -1323,6 +1323,27 @@ vectorize_loops (void)
> ??? Also while we try hard to update loop-closed SSA form we fail
> to properly do this in some corner-cases (see PR56286). */
> rewrite_into_loop_closed_ssa (NULL, TODO_update_ssa_only_virtuals);
> +
> + for (i = 1; i < number_of_loops (cfun); i++)
> + {
> + loop = get_loop (cfun, i);
> + if (!loop || !single_exit (loop))
> + continue;
> +
> + bitmap exit_bbs;
> + /* Perform local CSE, this esp. helps because we emit code for
> + predicates that need to be shared for optimal predicate usage.
> + However reassoc will re-order them and prevent CSE from working
> + as it should. CSE only the loop body, not the entry. */
> + exit_bbs = BITMAP_ALLOC (NULL);
> + bitmap_set_bit (exit_bbs, single_exit (loop)->dest->index);
> + bitmap_set_bit (exit_bbs, loop->latch->index);
> +
> + do_rpo_vn (cfun, loop_preheader_edge (loop), exit_bbs);
> +
> + BITMAP_FREE (exit_bbs);
> + }
> +
> return TODO_cleanup_cfg;
> }
>
>
>
>
--
Richard Biener <[email protected]>
SUSE Software Solutions Germany GmbH, Maxfeldstrasse 5, 90409 Nuernberg,
Germany; GF: Ivo Totev; HRB 36809 (AG Nuernberg)