On Tue, 2 Nov 2021, Tamar Christina wrote:

> Hi All,
> 
> Following my current SVE predicate optimization series a problem has presented
> itself in that the way vector masks are generated for masked operations relies
> on CSE to share masks efficiently.
> 
> The issue however is that masking is done using the & operand and & is
> associative and so reassoc decides to reassociate the masked operations.

But it does this for the purpose of canonicalization and thus CSE.

> This makes CSE then unable to CSE an unmasked and a masked operation leading 
> to
> duplicate operations being performed.
> 
> To counter this we want to add an RPO pass over the vectorized loop body when
> vectorization succeeds.  This makes it then no longer reliant on the RTL level
> CSE.
> 
> I have not added a testcase for this as it requires the changes in my patch
> series, however the entire series relies on this patch to work so all the
> tests there cover it.
> 
> Bootstrapped Regtested on aarch64-none-linux-gnu,
> x86_64-linux-gnu and no issues.
> 
> Ok for master?

You are running VN over _all_ loop bodies rather only those vectorized.
We loop over vectorized loops earlier for optimizing masked store
sequences.  I suppose you could hook in there.  I'll also notice
that we have pass_pre_slp_scalar_cleanup which eventually runs
plus we have a late FRE.  So I don't understand why it doesn't
work to CSE later.

  for (i = 1; i < number_of_loops (cfun); i++)
    {
      loop_vec_info loop_vinfo;
      bool has_mask_store;

      loop = get_loop (cfun, i);
      if (!loop || !loop->aux)
        continue;
      loop_vinfo = (loop_vec_info) loop->aux;
      has_mask_store = LOOP_VINFO_HAS_MASK_STORE (loop_vinfo);
      delete loop_vinfo;
      if (has_mask_store
          && targetm.vectorize.empty_mask_is_expensive (IFN_MASK_STORE))
        optimize_mask_stores (loop);
      loop->aux = NULL;
    }


> Thanks,
> Tamar
> 
> gcc/ChangeLog:
> 
>       * tree-vectorizer.c (vectorize_loops): Do local CSE through RPVN upon
>       successful vectorization.
> 
> --- inline copy of patch -- 
> diff --git a/gcc/tree-vectorizer.c b/gcc/tree-vectorizer.c
> index 
> 4712dc6e7f907637774482a71036a0bd381c2bd2..1e370d60fb19b03c3b6bce45c660af4b6d32dc51
>  100644
> --- a/gcc/tree-vectorizer.c
> +++ b/gcc/tree-vectorizer.c
> @@ -81,7 +81,7 @@ along with GCC; see the file COPYING3.  If not see
>  #include "gimple-pretty-print.h"
>  #include "opt-problem.h"
>  #include "internal-fn.h"
> -
> +#include "tree-ssa-sccvn.h"
>  
>  /* Loop or bb location, with hotness information.  */
>  dump_user_location_t vect_location;
> @@ -1323,6 +1323,27 @@ vectorize_loops (void)
>        ???  Also while we try hard to update loop-closed SSA form we fail
>        to properly do this in some corner-cases (see PR56286).  */
>        rewrite_into_loop_closed_ssa (NULL, TODO_update_ssa_only_virtuals);
> +
> +      for (i = 1; i < number_of_loops (cfun); i++)
> +     {
> +       loop = get_loop (cfun, i);
> +       if (!loop || !single_exit (loop))
> +         continue;
> +
> +       bitmap exit_bbs;
> +       /* Perform local CSE, this esp. helps because we emit code for
> +          predicates that need to be shared for optimal predicate usage.
> +          However reassoc will re-order them and prevent CSE from working
> +          as it should.  CSE only the loop body, not the entry.  */
> +       exit_bbs = BITMAP_ALLOC (NULL);
> +       bitmap_set_bit (exit_bbs, single_exit (loop)->dest->index);
> +       bitmap_set_bit (exit_bbs, loop->latch->index);
> +
> +       do_rpo_vn (cfun, loop_preheader_edge (loop), exit_bbs);
> +
> +       BITMAP_FREE (exit_bbs);
> +     }
> +
>        return TODO_cleanup_cfg;
>      }
>  
> 
> 
> 

-- 
Richard Biener <rguent...@suse.de>
SUSE Software Solutions Germany GmbH, Maxfeldstrasse 5, 90409 Nuernberg,
Germany; GF: Ivo Totev; HRB 36809 (AG Nuernberg)

Reply via email to