Re: [PATCH] AArch64: Adjust costing of by element MUL to be the same as SAME3 MUL.

Tamar Christina Tue, 09 Jun 2020 05:24:09 -0700

Hi Richard,

The 06/09/2020 12:44, Richard Sandiford wrote:
> Tamar Christina <[email protected]> writes:
> > Hi Richard,
> > The 06/08/2020 16:42, Richard Sandiford wrote:
> >> Tamar Christina <[email protected]> writes:
> >> > diff --git a/gcc/config/aarch64/aarch64.c b/gcc/config/aarch64/aarch64.c
> >> > index 
> >> > 97da60762390db81df9cffaf316b909cd1609130..9cc8da338125afa01bc9fb645f4112d2d7ef548c
> >> >  100644
> >> > --- a/gcc/config/aarch64/aarch64.c
> >> > +++ b/gcc/config/aarch64/aarch64.c
> >> > @@ -11279,6 +11279,14 @@ aarch64_rtx_mult_cost (rtx x, enum rtx_code 
> >> > code, int outer, bool speed)
> >> >    if (VECTOR_MODE_P (mode))
> >> >      mode = GET_MODE_INNER (mode);
> >> > 
> >> > +  /* The by element versions of the instruction has the same costs as 
> >> > the
> >> > +     normal 3 vector version.  So don't add the costs of the duplicate 
> >> > into
> >> > +     the costs of the multiply.  */
> >> > +  if (GET_CODE (op0) == VEC_DUPLICATE)
> >> > +    op0 = XEXP (op0, 0);
> >> > +  else if (GET_CODE (op1) == VEC_DUPLICATE)
> >> > +    op1 = XEXP (op1, 0);
> >> > +
> >> >    /* Integer multiply/fma.  */
> >> >    if (GET_MODE_CLASS (mode) == MODE_INT)
> >> >      {
> >> 
> >> SVE doesn't have duplicating forms, so I think we should put this code
> >> under the “if (VECTOR_MODE_P (mode))” condition, before changing “mode”,
> >> and then restrict it to VEC_ADVSIMD modes.
> >> 
> >> (SVE FMUL does have an indexed form, but the index is relative to the
> >> start of the associated quadword, so it isn't a VEC_DUPLICATE.)
> >>
> >
> > Done, I have updated the patch. (See attached)
> >  
> >> I guess there's a danger that this could underestimate the cost for
> >> integer modes, if the scalar integer input needs to be moved from GPRs.
> >> In that case the cost of a MULT + VEC_DUPLICATE is probably more
> >> accurate, even though it's still one instruction before RA.
> >> 
> >> But I guess there's no perfect answer there.  The new code will be
> >> right for integer modes in some cases and not in others.  Same if
> >> we leave things as they are.  But maybe it'd be worth having a comment
> >> to say that we're assuming the best case, i.e. that the duplicated
> >> value is naturally in FPRs?
> >> 
> >
> > Hmm I haven't added the comment yet since I don't fully understand when the
> > integer case would be misleading.
> >
> > In both cases the cost for the GPR is paid by the MOV no? I'm missing
> > why having the MUL account for it would be better in some cases.
> 
> The point was that any MOV isn't exposed until after register allocation,
> whereas costs are usually applied before then.  So before RA:
> 
> > For instance for the integer case we used to generate
> >
> >         dup     v0.4s, w2
> >         mul     v2.4s, v2.4s, v0.4s
> 
> ...this was costed as:
> 
>    (set (reg:V4SI R2) (vec_duplicate:V4SI (reg:SI R1)))
>    (set (reg:V4SI R3) (mult:V4SI ...))
> 
> and so accurate when R1 naturally ends up in a GPR.
> 
> > but now do
> >
> >         fmov    s0, w2
> >         mul     v2.4s, v2.4s, v0.s[0]
> 
> ...and this is costed as:
> 
>    (set (reg:V4SI R3) (mult:V4SI ...))
> 
> and so accurate when R1 naturally ends up in an FPR (without needing
> a reload to put it there).
> 
> In other words, before RA, the patch is making the optimistic assumption
> that R1 is already in FPRs and so a separate FMOV won't be needed.
>


Aargggs... yes that makes sense. Sorry when I looked at the dump before I 
didn't noticed the order was switched.
The SET was for the load of course. :(

I have added the comment as suggested, thanks for the explanation.

OK for master?

Thanks,
Tamar

> Thanks,
> Richard
> 
> > Which is better on older cores such Cortex-A55 and no different on newer 
> > cores such as
> > Cortex-A76 according to the optimization guides.
> >
> > Regards,
> > Tamar
> >
> >> Thanks,
> >> Richard

--

diff --git a/gcc/config/aarch64/aarch64.c b/gcc/config/aarch64/aarch64.c
index 973c65aa4fb348450872036617362aa17310fb20..5a5a9ad44f0945b4d6a869fc2b4e857022659c55 100644
--- a/gcc/config/aarch64/aarch64.c
+++ b/gcc/config/aarch64/aarch64.c
@@ -11279,7 +11279,22 @@ aarch64_rtx_mult_cost (rtx x, enum rtx_code code, int outer, bool speed)
   op1 = XEXP (x, 1);
 
   if (VECTOR_MODE_P (mode))
-    mode = GET_MODE_INNER (mode);
+    {
+      unsigned int vec_flags = aarch64_classify_vector_mode (mode);
+      mode = GET_MODE_INNER (mode);
+      if (vec_flags & VEC_ADVSIMD)
+	{
+	  /* The by element versions of the instruction has the same costs as the
+	     normal 3 vector version.  So don't add the costs of the duplicate into
+	     the costs of the multiply.  We make an assumption that the value in
+	     the VEC_DUPLICATE is already the FP&SIMD side.  This means costing of
+	     a MUL by element pre RA is a bit optimistic.  */
+	  if (GET_CODE (op0) == VEC_DUPLICATE)
+	    op0 = XEXP (op0, 0);
+	  else if (GET_CODE (op1) == VEC_DUPLICATE)
+	    op1 = XEXP (op1, 0);
+	}
+    }
 
   /* Integer multiply/fma.  */
   if (GET_MODE_CLASS (mode) == MODE_INT)
diff --git a/gcc/testsuite/gcc.target/aarch64/asimd-mull-elem.c b/gcc/testsuite/gcc.target/aarch64/asimd-mull-elem.c
new file mode 100644
index 0000000000000000000000000000000000000000..513721cee0c8372781e6daf33bc06e256cab8cb8
--- /dev/null
+++ b/gcc/testsuite/gcc.target/aarch64/asimd-mull-elem.c
@@ -0,0 +1,21 @@
+/* { dg-do compile } */
+/* { dg-require-effective-target vect_int } */
+/* { dg-require-effective-target vect_float } */
+/* { dg-options "-Ofast" } */
+
+#include <arm_neon.h>
+
+void s_mult_i (int32_t* restrict res, int32_t* restrict a, int32_t b)
+{
+    for (int x = 0; x < 16; x++)
+      res[x] = a[x] * b;
+}
+
+void s_mult_f (float32_t* restrict res, float32_t* restrict a, float32_t b)
+{
+    for (int x = 0; x < 16; x++)
+      res[x] = a[x] * b;
+}
+
+/* { dg-final { scan-assembler-times {\s+mul\tv[0-9]+\.4s, v[0-9]+\.4s, v[0-9]+\.s\[0\]} 4 } } */
+/* { dg-final { scan-assembler-times {\s+fmul\tv[0-9]+\.4s, v[0-9]+\.4s, v[0-9]+\.s\[0\]} 4 } } */

Re: [PATCH] AArch64: Adjust costing of by element MUL to be the same as SAME3 MUL.

Reply via email to