Hi Richard,
The 06/09/2020 12:44, Richard Sandiford wrote:
> Tamar Christina <[email protected]> writes:
> > Hi Richard,
> > The 06/08/2020 16:42, Richard Sandiford wrote:
> >> Tamar Christina <[email protected]> writes:
> >> > diff --git a/gcc/config/aarch64/aarch64.c b/gcc/config/aarch64/aarch64.c
> >> > index
> >> > 97da60762390db81df9cffaf316b909cd1609130..9cc8da338125afa01bc9fb645f4112d2d7ef548c
> >> > 100644
> >> > --- a/gcc/config/aarch64/aarch64.c
> >> > +++ b/gcc/config/aarch64/aarch64.c
> >> > @@ -11279,6 +11279,14 @@ aarch64_rtx_mult_cost (rtx x, enum rtx_code
> >> > code, int outer, bool speed)
> >> > if (VECTOR_MODE_P (mode))
> >> > mode = GET_MODE_INNER (mode);
> >> >
> >> > + /* The by element versions of the instruction has the same costs as
> >> > the
> >> > + normal 3 vector version. So don't add the costs of the duplicate
> >> > into
> >> > + the costs of the multiply. */
> >> > + if (GET_CODE (op0) == VEC_DUPLICATE)
> >> > + op0 = XEXP (op0, 0);
> >> > + else if (GET_CODE (op1) == VEC_DUPLICATE)
> >> > + op1 = XEXP (op1, 0);
> >> > +
> >> > /* Integer multiply/fma. */
> >> > if (GET_MODE_CLASS (mode) == MODE_INT)
> >> > {
> >>
> >> SVE doesn't have duplicating forms, so I think we should put this code
> >> under the “if (VECTOR_MODE_P (mode))” condition, before changing “mode”,
> >> and then restrict it to VEC_ADVSIMD modes.
> >>
> >> (SVE FMUL does have an indexed form, but the index is relative to the
> >> start of the associated quadword, so it isn't a VEC_DUPLICATE.)
> >>
> >
> > Done, I have updated the patch. (See attached)
> >
> >> I guess there's a danger that this could underestimate the cost for
> >> integer modes, if the scalar integer input needs to be moved from GPRs.
> >> In that case the cost of a MULT + VEC_DUPLICATE is probably more
> >> accurate, even though it's still one instruction before RA.
> >>
> >> But I guess there's no perfect answer there. The new code will be
> >> right for integer modes in some cases and not in others. Same if
> >> we leave things as they are. But maybe it'd be worth having a comment
> >> to say that we're assuming the best case, i.e. that the duplicated
> >> value is naturally in FPRs?
> >>
> >
> > Hmm I haven't added the comment yet since I don't fully understand when the
> > integer case would be misleading.
> >
> > In both cases the cost for the GPR is paid by the MOV no? I'm missing
> > why having the MUL account for it would be better in some cases.
>
> The point was that any MOV isn't exposed until after register allocation,
> whereas costs are usually applied before then. So before RA:
>
> > For instance for the integer case we used to generate
> >
> > dup v0.4s, w2
> > mul v2.4s, v2.4s, v0.4s
>
> ...this was costed as:
>
> (set (reg:V4SI R2) (vec_duplicate:V4SI (reg:SI R1)))
> (set (reg:V4SI R3) (mult:V4SI ...))
>
> and so accurate when R1 naturally ends up in a GPR.
>
> > but now do
> >
> > fmov s0, w2
> > mul v2.4s, v2.4s, v0.s[0]
>
> ...and this is costed as:
>
> (set (reg:V4SI R3) (mult:V4SI ...))
>
> and so accurate when R1 naturally ends up in an FPR (without needing
> a reload to put it there).
>
> In other words, before RA, the patch is making the optimistic assumption
> that R1 is already in FPRs and so a separate FMOV won't be needed.
>
Aargggs... yes that makes sense. Sorry when I looked at the dump before I
didn't noticed the order was switched.
The SET was for the load of course. :(
I have added the comment as suggested, thanks for the explanation.
OK for master?
Thanks,
Tamar
> Thanks,
> Richard
>
> > Which is better on older cores such Cortex-A55 and no different on newer
> > cores such as
> > Cortex-A76 according to the optimization guides.
> >
> > Regards,
> > Tamar
> >
> >> Thanks,
> >> Richard
--
diff --git a/gcc/config/aarch64/aarch64.c b/gcc/config/aarch64/aarch64.c
index 973c65aa4fb348450872036617362aa17310fb20..5a5a9ad44f0945b4d6a869fc2b4e857022659c55 100644
--- a/gcc/config/aarch64/aarch64.c
+++ b/gcc/config/aarch64/aarch64.c
@@ -11279,7 +11279,22 @@ aarch64_rtx_mult_cost (rtx x, enum rtx_code code, int outer, bool speed)
op1 = XEXP (x, 1);
if (VECTOR_MODE_P (mode))
- mode = GET_MODE_INNER (mode);
+ {
+ unsigned int vec_flags = aarch64_classify_vector_mode (mode);
+ mode = GET_MODE_INNER (mode);
+ if (vec_flags & VEC_ADVSIMD)
+ {
+ /* The by element versions of the instruction has the same costs as the
+ normal 3 vector version. So don't add the costs of the duplicate into
+ the costs of the multiply. We make an assumption that the value in
+ the VEC_DUPLICATE is already the FP&SIMD side. This means costing of
+ a MUL by element pre RA is a bit optimistic. */
+ if (GET_CODE (op0) == VEC_DUPLICATE)
+ op0 = XEXP (op0, 0);
+ else if (GET_CODE (op1) == VEC_DUPLICATE)
+ op1 = XEXP (op1, 0);
+ }
+ }
/* Integer multiply/fma. */
if (GET_MODE_CLASS (mode) == MODE_INT)
diff --git a/gcc/testsuite/gcc.target/aarch64/asimd-mull-elem.c b/gcc/testsuite/gcc.target/aarch64/asimd-mull-elem.c
new file mode 100644
index 0000000000000000000000000000000000000000..513721cee0c8372781e6daf33bc06e256cab8cb8
--- /dev/null
+++ b/gcc/testsuite/gcc.target/aarch64/asimd-mull-elem.c
@@ -0,0 +1,21 @@
+/* { dg-do compile } */
+/* { dg-require-effective-target vect_int } */
+/* { dg-require-effective-target vect_float } */
+/* { dg-options "-Ofast" } */
+
+#include <arm_neon.h>
+
+void s_mult_i (int32_t* restrict res, int32_t* restrict a, int32_t b)
+{
+ for (int x = 0; x < 16; x++)
+ res[x] = a[x] * b;
+}
+
+void s_mult_f (float32_t* restrict res, float32_t* restrict a, float32_t b)
+{
+ for (int x = 0; x < 16; x++)
+ res[x] = a[x] * b;
+}
+
+/* { dg-final { scan-assembler-times {\s+mul\tv[0-9]+\.4s, v[0-9]+\.4s, v[0-9]+\.s\[0\]} 4 } } */
+/* { dg-final { scan-assembler-times {\s+fmul\tv[0-9]+\.4s, v[0-9]+\.4s, v[0-9]+\.s\[0\]} 4 } } */