Re: [PATCH 1/2] aarch64: Use standard names for saturating arithmetic

Akram Ahmad Wed, 23 Oct 2024 07:32:53 -0700

On 23/10/2024 12:20, Richard Sandiford wrote:

Thanks for doing this.  The approach looks good.  My main question is:
are we sure that we want to use the Advanced SIMD instructions for
signed saturating SI and DI arithmetic on GPRs?  E.g. for addition,
we only saturate at the negative limit if both operands are negative,
and only saturate at the positive limit if both operands are positive.
So for 32-bit values we can use:


        asr     tmp, x or y, #31
        eor     tmp, tmp, #0x80000000

to calculate the saturation value and:

        adds    res, x, y
        csel    res, tmp, res, vs

to calculate the full result.  That's the same number of instructions
as two fmovs for the inputs, the sqadd, and the fmov for the result,
but it should be more efficient.

The reason for asking now, rather than treating it as a potential
future improvement, is that it would also avoid splitting the patterns
for signed and unsigned ops.  (The length of the split alternative can be
conservatively set to 16 even for the unsigned version, since nothing
should care in practice.  The split will have happened before
shorten_branches.)


Hi Richard, thanks for looking over this.

I might be misunderstanding your suggestion, but is there a way toefficientlycheck the signedness of the second operand (let's say 'y') if it isstored ina register? This is a problem we considered and couldn't solvepost-reload, aswe only have three registers (including two operands) to work with. (Imight bewrong in terms of how many registers we have available). AFAIK that'swhy we only

use adds, csinv / subs, csel in the unsigned case.

To illustrate the point better: consider signed X + Y where both operands

are in GPR. Without knowing the signedness of Y, for branchless code, wewould

need to saturate at both the positive and negative limit and then perform a

comparison on Y to check the sign, selecting either saturating limitaccordingly.This of course doesn't apply if signed saturating 'addition' with anegative op2is only required to saturate to the positive limit- nor does it apply ifY or

op2 is an immediate.

Otherwise, I agree that this should be fixed now rather than as a future
improvement.

gcc/ChangeLog:

        * config/aarch64/aarch64-builtins.cc: Expand iterators.
        * config/aarch64/aarch64-simd-builtins.def: Use standard names
        * config/aarch64/aarch64-simd.md: Use standard names, split insn
        definitions on signedness of operator and type of operands.
        * config/aarch64/arm_neon.h: Use standard builtin names.
        * config/aarch64/iterators.md: Add VSDQ_I_QI_HI iterator to
        simplify splitting of insn for unsigned scalar arithmetic.

gcc/testsuite/ChangeLog:

        * 
gcc.target/aarch64/advsimd-intrinsics/saturating_arithmetic_autovect.inc:
        Template file for unsigned vector saturating arithmetic tests.
        * 
gcc.target/aarch64/advsimd-intrinsics/saturating_arithmetic_autovect_1.c:
        8-bit vector type tests.
        * 
gcc.target/aarch64/advsimd-intrinsics/saturating_arithmetic_autovect_2.c:
        16-bit vector type tests.
        * 
gcc.target/aarch64/advsimd-intrinsics/saturating_arithmetic_autovect_3.c:
        32-bit vector type tests.
        * 
gcc.target/aarch64/advsimd-intrinsics/saturating_arithmetic_autovect_4.c:
        64-bit vector type tests.
        * gcc.target/aarch64/saturating_arithmetic.inc: Template file
        for scalar saturating arithmetic tests.
        * gcc.target/aarch64/saturating_arithmetic_1.c: 8-bit tests.
        * gcc.target/aarch64/saturating_arithmetic_2.c: 16-bit tests.
        * gcc.target/aarch64/saturating_arithmetic_3.c: 32-bit tests.
        * gcc.target/aarch64/saturating_arithmetic_4.c: 64-bit tests.
diff --git 
a/gcc/testsuite/gcc.target/aarch64/advsimd-intrinsics/saturating_arithmetic_autovect_1.c
 
b/gcc/testsuite/gcc.target/aarch64/advsimd-intrinsics/saturating_arithmetic_autovect_1.c
new file mode 100644
index 00000000000..63eb21e438b
--- /dev/null
+++ 
b/gcc/testsuite/gcc.target/aarch64/advsimd-intrinsics/saturating_arithmetic_autovect_1.c
@@ -0,0 +1,79 @@
+/* { dg-do assemble { target { aarch64*-*-* } } } */
+/* { dg-options "-O2 --save-temps -ftree-vectorize" } */
+/* { dg-final { check-function-bodies "**" "" "-DCHECK_ASM" } } */
+
+/*
+** uadd_lane: { xfail *-*-* }

Just curious: why does this fail?  Is it a vector costing issue?

This is due to a missing pattern from match.pd- I've sent another patch
upstream to rectify this. In essence, this function exposes a commutative

form of an existing addition pattern, but that form isn't currentlycommutativewhen it should be. It's a similar reason for why the uqsubs are alsomarked as

xfail, so that same patch series contains a fix for the uqsub case too.

Since the operands are commutative, and since there's no restriction
on the choice of destination register, it's probably safer to use:

+**     uqadd\tv[0-9].16b, (?:v\1.16b, v\2.16b|v\2.16b, v\1.16b)

Similarly for the other qadds.  The qsubs do of course have a fixed
order, but the destination is similarly not restricted, so should use
[0-9]+ rather than \n.

Ah yes, thanks for spotting that. This was a remnant from when I imposed the

constraint that operands 0 and 1 arrived in the same register, but Ilater removed

it as I couldn't understand why I put it there to begin with :)

Thanks once again for the feedback!
Akram

Re: [PATCH 1/2] aarch64: Use standard names for saturating arithmetic

Reply via email to