Richard Sandiford via Gcc-patches <[email protected]> writes:
> Richard Biener <[email protected]> writes:
>> On Tue, Aug 3, 2021 at 2:10 PM Richard Sandiford via Gcc-patches
>> <[email protected]> wrote:
>>>
>>> The issue-based vector costs currently assume that a multiply-add
>>> sequence can be implemented using a single instruction. This is
>>> generally true for scalars (which have a 4-operand instruction)
>>> and SVE (which allows the output to be tied to any input).
>>> However, for Advanced SIMD, multiplying two values and adding
>>> an invariant will end up being a move and an MLA.
>>>
>>> The only target to use the issue-based vector costs is Neoverse V1,
>>> which would generally prefer SVE in this case anyway. I therefore
>>> don't have a self-contained testcase. However, the distinction
>>> becomes more important with a later patch.
>>
>> But we do cost any invariants separately (for the prologue), so they
>> should be available in a register. How doesn't that work?
>
> Yeah, that works, and the prologue part is costed correctly. But the
> result of an Advanced SIMD FMLA is tied to the accumulator input, so if
> the accumulator input is an invariant, we need a register move (in the
> loop body) before the FMLA.
>
> E.g. for:
>
> void
> f (float *restrict x, float *restrict y, float *restrict z, float a)
> {
> for (int i = 0; i < 100; ++i)
> x[i] = y[i] * z[i];
+ 1.0, not sure where that went.
> }
>
> the scalar code is:
>
> .L2:
> ldr s1, [x1, x3]
> ldr s2, [x2, x3]
> fmadd s1, s1, s2, s0
> str s1, [x0, x3]
> add x3, x3, 4
> cmp x3, 400
> bne .L2
>
> the SVE code is:
>
> .L2:
> ld1w z1.s, p0/z, [x1, x3, lsl 2]
> ld1w z0.s, p0/z, [x2, x3, lsl 2]
> fmad z0.s, p1/m, z1.s, z2.s
> st1w z0.s, p0, [x0, x3, lsl 2]
> add x3, x3, x5
> whilelo p0.s, w3, w4
> b.any .L2
>
> but the Advanced SIMD code is:
>
> .L2:
> mov v0.16b, v3.16b // <-------- boo
> ldr q2, [x2, x3]
> ldr q1, [x1, x3]
> fmla v0.4s, v2.4s, v1.4s
> str q0, [x0, x3]
> add x3, x3, 16
> cmp x3, 400
> bne .L2
>
> Thanks,
> Richard
>
>
>>
>>> gcc/
>>> * config/aarch64/aarch64.c (aarch64_multiply_add_p): Add a vec_flags
>>> parameter. Detect cases in which an Advanced SIMD MLA would almost
>>> certainly require a MOV.
>>> (aarch64_count_ops): Update accordingly.
>>> ---
>>> gcc/config/aarch64/aarch64.c | 25 ++++++++++++++++++++++---
>>> 1 file changed, 22 insertions(+), 3 deletions(-)
>>>
>>> diff --git a/gcc/config/aarch64/aarch64.c b/gcc/config/aarch64/aarch64.c
>>> index 084f8caa0da..19045ef6944 100644
>>> --- a/gcc/config/aarch64/aarch64.c
>>> +++ b/gcc/config/aarch64/aarch64.c
>>> @@ -14767,9 +14767,12 @@ aarch64_integer_truncation_p (stmt_vec_info
>>> stmt_info)
>>>
>>> /* Return true if STMT_INFO is the second part of a two-statement
>>> multiply-add
>>> or multiply-subtract sequence that might be suitable for fusing into a
>>> - single instruction. */
>>> + single instruction. If VEC_FLAGS is zero, analyze the operation as
>>> + a scalar one, otherwise analyze it as an operation on vectors with those
>>> + VEC_* flags. */
>>> static bool
>>> -aarch64_multiply_add_p (vec_info *vinfo, stmt_vec_info stmt_info)
>>> +aarch64_multiply_add_p (vec_info *vinfo, stmt_vec_info stmt_info,
>>> + unsigned int vec_flags)
>>> {
>>> gassign *assign = dyn_cast<gassign *> (stmt_info->stmt);
>>> if (!assign)
>>> @@ -14797,6 +14800,22 @@ aarch64_multiply_add_p (vec_info *vinfo,
>>> stmt_vec_info stmt_info)
>>> if (!rhs_assign || gimple_assign_rhs_code (rhs_assign) != MULT_EXPR)
>>> continue;
>>>
>>> + if (vec_flags & VEC_ADVSIMD)
>>> + {
>>> + /* Scalar and SVE code can tie the result to any FMLA input (or
>>> none,
>>> + although that requires a MOVPRFX for SVE). However, Advanced
>>> SIMD
>>> + only supports MLA forms, so will require a move if the result
>>> + cannot be tied to the accumulator. The most important case in
>>> + which this is true is when the accumulator input is invariant.
>>> */
>>> + rhs = gimple_op (assign, 3 - i);
>>> + if (TREE_CODE (rhs) != SSA_NAME)
>>> + return false;
>>> + def_stmt_info = vinfo->lookup_def (rhs);
>>> + if (!def_stmt_info
>>> + || STMT_VINFO_DEF_TYPE (def_stmt_info) == vect_external_def)
>>> + return false;
>>> + }
>>> +
>>> return true;
>>> }
>>> return false;
>>> @@ -15232,7 +15251,7 @@ aarch64_count_ops (class vec_info *vinfo,
>>> aarch64_vector_costs *costs,
>>> }
>>>
>>> /* Assume that multiply-adds will become a single operation. */
>>> - if (stmt_info && aarch64_multiply_add_p (vinfo, stmt_info))
>>> + if (stmt_info && aarch64_multiply_add_p (vinfo, stmt_info, vec_flags))
>>> return;
>>>
>>> /* When costing scalar statements in vector code, the count already