On 23/10/2024 12:20, Richard Sandiford wrote:
Thanks for doing this.  The approach looks good.  My main question is:
are we sure that we want to use the Advanced SIMD instructions for
signed saturating SI and DI arithmetic on GPRs?  E.g. for addition,
we only saturate at the negative limit if both operands are negative,
and only saturate at the positive limit if both operands are positive.
So for 32-bit values we can use:

        asr     tmp, x or y, #31
        eor     tmp, tmp, #0x80000000

to calculate the saturation value and:

        adds    res, x, y
        csel    res, tmp, res, vs

to calculate the full result.  That's the same number of instructions
as two fmovs for the inputs, the sqadd, and the fmov for the result,
but it should be more efficient.

The reason for asking now, rather than treating it as a potential
future improvement, is that it would also avoid splitting the patterns
for signed and unsigned ops.  (The length of the split alternative can be
conservatively set to 16 even for the unsigned version, since nothing
should care in practice.  The split will have happened before
shorten_branches.)

Hi Richard, thanks for looking over this.

I might be misunderstanding your suggestion, but is there a way to efficiently check the signedness of the second operand (let's say 'y') if it is stored in a register? This is a problem we considered and couldn't solve post-reload, as we only have three registers (including two operands) to work with. (I might be wrong in terms of how many registers we have available). AFAIK that's why we only
use adds, csinv / subs, csel in the unsigned case.

To illustrate the point better: consider signed X + Y where both operands
are in GPR. Without knowing the signedness of Y, for branchless code, we would
need to saturate at both the positive and negative limit and then perform a
comparison on Y to check the sign, selecting either saturating limit accordingly. This of course doesn't apply if signed saturating 'addition' with a negative op2 is only required to saturate to the positive limit- nor does it apply if Y or
op2 is an immediate.

Otherwise, I agree that this should be fixed now rather than as a future
improvement.


gcc/ChangeLog:

        * config/aarch64/aarch64-builtins.cc: Expand iterators.
        * config/aarch64/aarch64-simd-builtins.def: Use standard names
        * config/aarch64/aarch64-simd.md: Use standard names, split insn
        definitions on signedness of operator and type of operands.
        * config/aarch64/arm_neon.h: Use standard builtin names.
        * config/aarch64/iterators.md: Add VSDQ_I_QI_HI iterator to
        simplify splitting of insn for unsigned scalar arithmetic.

gcc/testsuite/ChangeLog:

        * 
gcc.target/aarch64/advsimd-intrinsics/saturating_arithmetic_autovect.inc:
        Template file for unsigned vector saturating arithmetic tests.
        * 
gcc.target/aarch64/advsimd-intrinsics/saturating_arithmetic_autovect_1.c:
        8-bit vector type tests.
        * 
gcc.target/aarch64/advsimd-intrinsics/saturating_arithmetic_autovect_2.c:
        16-bit vector type tests.
        * 
gcc.target/aarch64/advsimd-intrinsics/saturating_arithmetic_autovect_3.c:
        32-bit vector type tests.
        * 
gcc.target/aarch64/advsimd-intrinsics/saturating_arithmetic_autovect_4.c:
        64-bit vector type tests.
        * gcc.target/aarch64/saturating_arithmetic.inc: Template file
        for scalar saturating arithmetic tests.
        * gcc.target/aarch64/saturating_arithmetic_1.c: 8-bit tests.
        * gcc.target/aarch64/saturating_arithmetic_2.c: 16-bit tests.
        * gcc.target/aarch64/saturating_arithmetic_3.c: 32-bit tests.
        * gcc.target/aarch64/saturating_arithmetic_4.c: 64-bit tests.
diff --git 
a/gcc/testsuite/gcc.target/aarch64/advsimd-intrinsics/saturating_arithmetic_autovect_1.c
 
b/gcc/testsuite/gcc.target/aarch64/advsimd-intrinsics/saturating_arithmetic_autovect_1.c
new file mode 100644
index 00000000000..63eb21e438b
--- /dev/null
+++ 
b/gcc/testsuite/gcc.target/aarch64/advsimd-intrinsics/saturating_arithmetic_autovect_1.c
@@ -0,0 +1,79 @@
+/* { dg-do assemble { target { aarch64*-*-* } } } */
+/* { dg-options "-O2 --save-temps -ftree-vectorize" } */
+/* { dg-final { check-function-bodies "**" "" "-DCHECK_ASM" } } */
+
+/*
+** uadd_lane: { xfail *-*-* }
Just curious: why does this fail?  Is it a vector costing issue?
This is due to a missing pattern from match.pd- I've sent another patch
upstream to rectify this. In essence, this function exposes a commutative
form of an existing addition pattern, but that form isn't currently commutative when it should be. It's a similar reason for why the uqsubs are also marked as
xfail, so that same patch series contains a fix for the uqsub case too.
Since the operands are commutative, and since there's no restriction
on the choice of destination register, it's probably safer to use:

+**     uqadd\tv[0-9].16b, (?:v\1.16b, v\2.16b|v\2.16b, v\1.16b)
Similarly for the other qadds.  The qsubs do of course have a fixed
order, but the destination is similarly not restricted, so should use
[0-9]+ rather than \n.
Ah yes, thanks for spotting that. This was a remnant from when I imposed the
constraint that operands 0 and 1 arrived in the same register, but I later removed
it as I couldn't understand why I put it there to begin with :)

Thanks once again for the feedback!
Akram

Reply via email to