Richard Sandiford <richard.sandif...@arm.com> writes: > Akram Ahmad <akram.ah...@arm.com> writes: >> This renames the existing {s,u}q{add,sub} instructions to use the >> standard names {s,u}s{add,sub}3 which are used by IFN_SAT_ADD and >> IFN_SAT_SUB. >> >> The NEON intrinsics for saturating arithmetic and their corresponding >> builtins are changed to use these standard names too. >> >> Using the standard names for the instructions causes 32 and 64-bit >> unsigned scalar saturating arithmetic to use the NEON instructions, >> resulting in an additional (and inefficient) FMOV to be generated when >> the original operands are in GP registers. This patch therefore also >> restores the original behaviour of using the adds/subs instructions >> in this circumstance. >> >> Additional tests are written for the scalar and Adv. SIMD cases to >> ensure that the correct instructions are used. The NEON intrinsics are >> already tested elsewhere. > > Thanks for doing this. The approach looks good. My main question is: > are we sure that we want to use the Advanced SIMD instructions for > signed saturating SI and DI arithmetic on GPRs? E.g. for addition, > we only saturate at the negative limit if both operands are negative, > and only saturate at the positive limit if both operands are positive. > So for 32-bit values we can use: > > asr tmp, x or y, #31 > eor tmp, tmp, #0x80000000 > > to calculate the saturation value and: > > adds res, x, y > csel res, tmp, res, vs
Bah, knew I should have sat on this before sending. tmp is the inverse of the saturation value, so we want: csinv res, res, tmp, vc instead of the csel above. > to calculate the full result. That's the same number of instructions > as two fmovs for the inputs, the sqadd, and the fmov for the result, > but it should be more efficient. > > The reason for asking now, rather than treating it as a potential > future improvement, is that it would also avoid splitting the patterns > for signed and unsigned ops. (The length of the split alternative can be > conservatively set to 16 even for the unsigned version, since nothing > should care in practice. The split will have happened before > shorten_branches.) > >> gcc/ChangeLog: >> >> * config/aarch64/aarch64-builtins.cc: Expand iterators. >> * config/aarch64/aarch64-simd-builtins.def: Use standard names >> * config/aarch64/aarch64-simd.md: Use standard names, split insn >> definitions on signedness of operator and type of operands. >> * config/aarch64/arm_neon.h: Use standard builtin names. >> * config/aarch64/iterators.md: Add VSDQ_I_QI_HI iterator to >> simplify splitting of insn for unsigned scalar arithmetic. >> >> gcc/testsuite/ChangeLog: >> >> * >> gcc.target/aarch64/advsimd-intrinsics/saturating_arithmetic_autovect.inc: >> Template file for unsigned vector saturating arithmetic tests. >> * >> gcc.target/aarch64/advsimd-intrinsics/saturating_arithmetic_autovect_1.c: >> 8-bit vector type tests. >> * >> gcc.target/aarch64/advsimd-intrinsics/saturating_arithmetic_autovect_2.c: >> 16-bit vector type tests. >> * >> gcc.target/aarch64/advsimd-intrinsics/saturating_arithmetic_autovect_3.c: >> 32-bit vector type tests. >> * >> gcc.target/aarch64/advsimd-intrinsics/saturating_arithmetic_autovect_4.c: >> 64-bit vector type tests. >> * gcc.target/aarch64/saturating_arithmetic.inc: Template file >> for scalar saturating arithmetic tests. >> * gcc.target/aarch64/saturating_arithmetic_1.c: 8-bit tests. >> * gcc.target/aarch64/saturating_arithmetic_2.c: 16-bit tests. >> * gcc.target/aarch64/saturating_arithmetic_3.c: 32-bit tests. >> * gcc.target/aarch64/saturating_arithmetic_4.c: 64-bit tests. >> diff --git >> a/gcc/testsuite/gcc.target/aarch64/advsimd-intrinsics/saturating_arithmetic_autovect_1.c >> >> b/gcc/testsuite/gcc.target/aarch64/advsimd-intrinsics/saturating_arithmetic_autovect_1.c >> new file mode 100644 >> index 00000000000..63eb21e438b >> --- /dev/null >> +++ >> b/gcc/testsuite/gcc.target/aarch64/advsimd-intrinsics/saturating_arithmetic_autovect_1.c >> @@ -0,0 +1,79 @@ >> +/* { dg-do assemble { target { aarch64*-*-* } } } */ >> +/* { dg-options "-O2 --save-temps -ftree-vectorize" } */ >> +/* { dg-final { check-function-bodies "**" "" "-DCHECK_ASM" } } */ >> + >> +/* >> +** uadd_lane: { xfail *-*-* } > > Just curious: why does this fail? Is it a vector costing issue? > >> +** dup\tv([0-9]+).8b, w0 >> +** uqadd\tb([0-9]+), b\1, b0 >> +** umov\tw0, v\2.b\[0] >> +** ret >> +*/ >> +/* >> +** uaddq: >> +** ... >> +** ldr\tq([0-9]+), .* >> +** ldr\tq([0-9]+), .* >> +** uqadd\tv\2.16b, v\1.16b, v\2.16b > > Since the operands are commutative, and since there's no restriction > on the choice of destination register, it's probably safer to use: > >> +** uqadd\tv[0-9].16b, (?:v\1.16b, v\2.16b|v\2.16b, v\1.16b) > > Similarly for the other qadds. The qsubs do of course have a fixed > order, but the destination is similarly not restricted, so should use > [0-9]+ rather than \n. > > Thanks, > Richard