Hi Richard and Marc,
Many thanks for both your feedback on my patch for PR 101895.
Here's version 2 of this patch, incorporating all of the suggested improvements.
The one minor complication is that the :s qualifier doesn't automatically
recognize that a capture already has two (or N) uses in a pattern,
so I have to manually confirm that there are no other uses of the mult
using num_imm_uses.

This revision has been tested on x86_64-pc-linux-gnu with make bootstrap
and make -k check with no new failures.  Ok for mainline?

2022-03-15  Roger Sayle  <ro...@nextmovesoftware.com>
            Marc Glisse  <marc.gli...@inria.fr>
            Richard Biener  <rguent...@suse.de>

gcc/ChangeLog
        PR tree-optimization/101895
        * match.pd (vec_same_elem_p): Handle CONSTRUCTOR_EXPR def.
        (plus (vec_perm (mult ...) ...) ...): New reordering simplification.

gcc/testsuite/ChangeLog
        PR tree-optimization/101895
        * gcc.target/i386/pr101895.c: New test case.


Thanks in advance,
Roger
--

> -----Original Message-----
> From: Richard Biener <richard.guent...@gmail.com>
> Sent: 14 March 2022 07:38
> To: GCC Patches <gcc-patches@gcc.gnu.org>
> Cc: Roger Sayle <ro...@nextmovesoftware.com>; Marc Glisse
> <marc.gli...@inria.fr>
> Subject: Re: [PATCH] PR tree-optimization/101895: Fold VEC_PERM to help
> recognize FMA.
> 
> On Sun, Mar 13, 2022 at 12:39 AM Marc Glisse via Gcc-patches <gcc-
> patc...@gcc.gnu.org> wrote:
> >
> > On Fri, 11 Mar 2022, Roger Sayle wrote:
> >
> > +(match vec_same_elem_p
> > +  CONSTRUCTOR@0
> > +  (if (uniform_vector_p (TREE_CODE (@0) == SSA_NAME
> > +                        ? gimple_assign_rhs1 (SSA_NAME_DEF_STMT (@0))
> > +: @0))))
> >
> > Ah, I didn't remember we needed that, we don't seem to be very
> > consistent about it. Probably for this reason, the transformation
> > "Prefer vector1 << scalar to vector1 << vector2" does not match
> >
> > typedef int vec __attribute__((vector_size(16))); vec f(vec a, int b){
> >    vec bb = { b, b, b, b };
> >    return a << bb;
> > }
> >
> > which is only optimized at vector lowering time.
> 
> Few more comments - since match.pd is matching in match.pd order the
> 
> (match vec_same_elem_p
>   @0
>   (...))
> 
> should come last.  Please use
> 
> +(match vec_same_elem_p
> +  CONSTRUCTOR@0
>     (if (TREE_CODE (@0) == SSA_NAME
>          && uniform_vector_p (...
> 
> since otherwise we'll try uniform_vector_p twice on all CTORs (that are not
> uniform).
> 
> > +/* Push VEC_PERM earlier if that may help FMA perception (PR101895).
> > +*/ (for plusminus (plus minus)
> > +  (simplify
> > +    (plusminus (vec_perm (mult@0 @1 vec_same_elem_p@2) @0 @3) @4)
> > +    (plusminus (mult (vec_perm @1 @1 @3) @2) @4)))
> >
> > Don't you want :s on mult and vec_perm?
> 
> Yes.  Also for plus you want :c on it , likewise you want :c on the mult.  
> The :c on
> the plus will require splitting the plus and minus case :/
> 
> Otherwise looks reasonable.
> 
> Richard.
> 
> >
> > --
> > Marc Glisse
diff --git a/gcc/match.pd b/gcc/match.pd
index 97399e5..12c92f4 100644
--- a/gcc/match.pd
+++ b/gcc/match.pd
@@ -7689,16 +7689,33 @@ and,
 /* VEC_PERM_EXPR (v, v, mask) -> v where v contains same element.  */
 
 (match vec_same_elem_p
+ (vec_duplicate @0))
+
+(match vec_same_elem_p
+ CONSTRUCTOR@0
+ (if (TREE_CODE (@0) == SSA_NAME
+      && uniform_vector_p (gimple_assign_rhs1 (SSA_NAME_DEF_STMT (@0))))))
+
+(match vec_same_elem_p
  @0
  (if (uniform_vector_p (@0))))
 
-(match vec_same_elem_p
- (vec_duplicate @0))
 
 (simplify
  (vec_perm vec_same_elem_p@0 @0 @1)
  @0)
 
+/* Push VEC_PERM earlier if that may help FMA perception (PR101895).  */
+(simplify
+ (plus:c (vec_perm:s (mult:c@0 @1 vec_same_elem_p@2) @0 @3) @4)
+ (if (TREE_CODE (@0) == SSA_NAME && num_imm_uses (@0) == 2)
+  (plus (mult (vec_perm @1 @1 @3) @2) @4)))
+(simplify
+ (minus (vec_perm:s (mult:c@0 @1 vec_same_elem_p@2) @0 @3) @4)
+ (if (TREE_CODE (@0) == SSA_NAME && num_imm_uses (@0) == 2)
+  (minus (mult (vec_perm @1 @1 @3) @2) @4)))
+
+
 /* Match count trailing zeroes for simplify_count_trailing_zeroes in fwprop.
    The canonical form is array[((x & -x) * C) >> SHIFT] where C is a magic
    constant which when multiplied by a power of 2 contains a unique value
diff --git a/gcc/testsuite/gcc.target/i386/pr101895.c 
b/gcc/testsuite/gcc.target/i386/pr101895.c
new file mode 100644
index 0000000..4d0f1cb
--- /dev/null
+++ b/gcc/testsuite/gcc.target/i386/pr101895.c
@@ -0,0 +1,11 @@
+/* { dg-do compile } */
+/* { dg-options "-O2 -march=cascadelake" } */
+
+void foo(float * __restrict__ a, float b, float *c) {
+  a[0] = c[0]*b + a[0];
+  a[1] = c[2]*b + a[1];
+  a[2] = c[1]*b + a[2];
+  a[3] = c[3]*b + a[3];
+}
+
+/* { dg-final { scan-assembler "vfmadd" } } */

Reply via email to