Richard Sandiford <richard.sandif...@arm.com> writes:
> Akram Ahmad <akram.ah...@arm.com> writes:
>> This renames the existing {s,u}q{add,sub} instructions to use the
>> standard names {s,u}s{add,sub}3 which are used by IFN_SAT_ADD and
>> IFN_SAT_SUB.
>>
>> The NEON intrinsics for saturating arithmetic and their corresponding
>> builtins are changed to use these standard names too.
>>
>> Using the standard names for the instructions causes 32 and 64-bit
>> unsigned scalar saturating arithmetic to use the NEON instructions,
>> resulting in an additional (and inefficient) FMOV to be generated when
>> the original operands are in GP registers. This patch therefore also
>> restores the original behaviour of using the adds/subs instructions
>> in this circumstance.
>>
>> Additional tests are written for the scalar and Adv. SIMD cases to
>> ensure that the correct instructions are used. The NEON intrinsics are
>> already tested elsewhere.
>
> Thanks for doing this.  The approach looks good.  My main question is:
> are we sure that we want to use the Advanced SIMD instructions for
> signed saturating SI and DI arithmetic on GPRs?  E.g. for addition,
> we only saturate at the negative limit if both operands are negative,
> and only saturate at the positive limit if both operands are positive.
> So for 32-bit values we can use:
>
>       asr     tmp, x or y, #31
>       eor     tmp, tmp, #0x80000000
>
> to calculate the saturation value and:
>
>       adds    res, x, y
>       csel    res, tmp, res, vs

Bah, knew I should have sat on this before sending.  tmp is the
inverse of the saturation value, so we want:

        csinv   res, res, tmp, vc

instead of the csel above.

> to calculate the full result.  That's the same number of instructions
> as two fmovs for the inputs, the sqadd, and the fmov for the result,
> but it should be more efficient.
>
> The reason for asking now, rather than treating it as a potential
> future improvement, is that it would also avoid splitting the patterns
> for signed and unsigned ops.  (The length of the split alternative can be
> conservatively set to 16 even for the unsigned version, since nothing
> should care in practice.  The split will have happened before
> shorten_branches.)
>
>> gcc/ChangeLog:
>>
>>      * config/aarch64/aarch64-builtins.cc: Expand iterators.
>>      * config/aarch64/aarch64-simd-builtins.def: Use standard names
>>      * config/aarch64/aarch64-simd.md: Use standard names, split insn
>>      definitions on signedness of operator and type of operands.
>>      * config/aarch64/arm_neon.h: Use standard builtin names.
>>      * config/aarch64/iterators.md: Add VSDQ_I_QI_HI iterator to
>>      simplify splitting of insn for unsigned scalar arithmetic.
>>
>> gcc/testsuite/ChangeLog:
>>
>>      * 
>> gcc.target/aarch64/advsimd-intrinsics/saturating_arithmetic_autovect.inc:
>>      Template file for unsigned vector saturating arithmetic tests.
>>      * 
>> gcc.target/aarch64/advsimd-intrinsics/saturating_arithmetic_autovect_1.c:
>>      8-bit vector type tests.
>>      * 
>> gcc.target/aarch64/advsimd-intrinsics/saturating_arithmetic_autovect_2.c:
>>      16-bit vector type tests.
>>      * 
>> gcc.target/aarch64/advsimd-intrinsics/saturating_arithmetic_autovect_3.c:
>>      32-bit vector type tests.
>>      * 
>> gcc.target/aarch64/advsimd-intrinsics/saturating_arithmetic_autovect_4.c:
>>      64-bit vector type tests.
>>      * gcc.target/aarch64/saturating_arithmetic.inc: Template file
>>      for scalar saturating arithmetic tests.
>>      * gcc.target/aarch64/saturating_arithmetic_1.c: 8-bit tests.
>>      * gcc.target/aarch64/saturating_arithmetic_2.c: 16-bit tests.
>>      * gcc.target/aarch64/saturating_arithmetic_3.c: 32-bit tests.
>>      * gcc.target/aarch64/saturating_arithmetic_4.c: 64-bit tests.
>> diff --git 
>> a/gcc/testsuite/gcc.target/aarch64/advsimd-intrinsics/saturating_arithmetic_autovect_1.c
>>  
>> b/gcc/testsuite/gcc.target/aarch64/advsimd-intrinsics/saturating_arithmetic_autovect_1.c
>> new file mode 100644
>> index 00000000000..63eb21e438b
>> --- /dev/null
>> +++ 
>> b/gcc/testsuite/gcc.target/aarch64/advsimd-intrinsics/saturating_arithmetic_autovect_1.c
>> @@ -0,0 +1,79 @@
>> +/* { dg-do assemble { target { aarch64*-*-* } } } */
>> +/* { dg-options "-O2 --save-temps -ftree-vectorize" } */
>> +/* { dg-final { check-function-bodies "**" "" "-DCHECK_ASM" } } */
>> +
>> +/*
>> +** uadd_lane: { xfail *-*-* }
>
> Just curious: why does this fail?  Is it a vector costing issue?
>
>> +**  dup\tv([0-9]+).8b, w0
>> +**  uqadd\tb([0-9]+), b\1, b0
>> +**  umov\tw0, v\2.b\[0]
>> +**  ret
>> +*/
>> +/*
>> +** uaddq:
>> +** ...
>> +**  ldr\tq([0-9]+), .*
>> +**  ldr\tq([0-9]+), .*
>> +**  uqadd\tv\2.16b, v\1.16b, v\2.16b
>
> Since the operands are commutative, and since there's no restriction
> on the choice of destination register, it's probably safer to use:
>
>> +**  uqadd\tv[0-9].16b, (?:v\1.16b, v\2.16b|v\2.16b, v\1.16b)
>
> Similarly for the other qadds.  The qsubs do of course have a fixed
> order, but the destination is similarly not restricted, so should use
> [0-9]+ rather than \n.
>
> Thanks,
> Richard

Reply via email to