RE: [PATCH] aarch64: Optimize SVE extract last to Neon lane extract for 128-bit VLS.

Tamar Christina Sat, 26 Apr 2025 23:43:18 -0700

> -----Original Message-----
> From: Richard Sandiford <richard.sandif...@arm.com>
> Sent: Friday, April 25, 2025 4:45 PM
> To: Jennifer Schmitz <jschm...@nvidia.com>
> Cc: gcc-patches@gcc.gnu.org
> Subject: Re: [PATCH] aarch64: Optimize SVE extract last to Neon lane extract 
> for
> 128-bit VLS.
> 
> Jennifer Schmitz <jschm...@nvidia.com> writes:
> > For the test case
> > int32_t foo (svint32_t x)
> > {
> >  svbool_t pg = svpfalse ();
> >  return svlastb_s32 (pg, x);
> > }
> > compiled with -O3 -mcpu=grace -msve-vector-bits=128, GCC produced:
> > foo:
> >     pfalse  p3.b
> >     lastb   w0, p3, z0.s
> >     ret
> > when it could use a Neon lane extract instead:
> > foo:
> >     umov    w0, v0.s[3]
> >     ret
> >
> > We implemented this optimization by guarding the emission of
> > pfalse+lastb in the pattern vec_extract<mode><Vel> by
> > known_gt (BYTES_PER_SVE_VECTOR, 16). Thus, for a last-extract operation
> > in 128-bit VLS, the pattern *vec_extract<mode><Vel>_v128 is used instead.
> >
> > The patch was bootstrapped and tested on aarch64-linux-gnu, no regression.
> > OK for mainline?
> >
> > Signed-off-by: Jennifer Schmitz <jschm...@nvidia.com>
> >
> > gcc/
> >     * config/aarch64/aarch64-sve.md (vec_extract<mode><Vel>):
> >     Prevent the emission of pfalse+lastb for 128-bit VLS.
> >
> > gcc/testsuite/
> >     * gcc.target/aarch64/sve/extract_last_128.c: New test.
> 
> OK, thanks.
>


Hi Both,

If I may, how about viewing this transformation differently.

How about instead make pfalse + lastb become rev + umov, which is
faster on all Arm micro-architectures for both VLS and VLA and then
the optimization becomes lowering rev + umov for VL128 into umov[3]?

This would speed up this case for everything.

It would also be good to have a test that checks that

#include <arm_sve.h>

int foo (svint32_t a)
{
    return svrev (a)[0];
}

Works. At the moment, with -msve-vector-bits=128 GCC transforms
this back into lastb, which is a de-optimization. Your current patch does fix it
so would be good to test this too.

Thanks,
Tamar

> Richard
> 
> > ---
> > gcc/config/aarch64/aarch64-sve.md             |  7 ++--
> > .../gcc.target/aarch64/sve/extract_last_128.c | 33 +++++++++++++++++++
> > 2 files changed, 37 insertions(+), 3 deletions(-)
> > create mode 100644 gcc/testsuite/gcc.target/aarch64/sve/extract_last_128.c
> >
> > diff --git a/gcc/config/aarch64/aarch64-sve.md b/gcc/config/aarch64/aarch64-
> sve.md
> > index 3dbd65986ec..824bd877e47 100644
> > --- a/gcc/config/aarch64/aarch64-sve.md
> > +++ b/gcc/config/aarch64/aarch64-sve.md
> > @@ -2969,10 +2969,11 @@
> >   {
> >     poly_int64 val;
> >     if (poly_int_rtx_p (operands[2], &val)
> > -   && known_eq (val, GET_MODE_NUNITS (<MODE>mode) - 1))
> > +   && known_eq (val, GET_MODE_NUNITS (<MODE>mode) - 1)
> > +   && known_gt (BYTES_PER_SVE_VECTOR, 16))
> >       {
> > -   /* The last element can be extracted with a LASTB and a false
> > -      predicate.  */
> > +   /* Extract the last element with a LASTB and a false predicate.
> > +      Exclude 128-bit VLS to use *vec_extract<mode><Vel>_v128.  */
> >     rtx sel = aarch64_pfalse_reg (<VPRED>mode);
> >     emit_insn (gen_extract_last_<mode> (operands[0], sel, operands[1]));
> >     DONE;
> > diff --git a/gcc/testsuite/gcc.target/aarch64/sve/extract_last_128.c
> b/gcc/testsuite/gcc.target/aarch64/sve/extract_last_128.c
> > new file mode 100644
> > index 00000000000..71d3561ec60
> > --- /dev/null
> > +++ b/gcc/testsuite/gcc.target/aarch64/sve/extract_last_128.c
> > @@ -0,0 +1,33 @@
> > +/* { dg-do compile } */
> > +/* { dg-options "-O3 -msve-vector-bits=128" } */
> > +
> > +#include <arm_sve.h>
> > +
> > +#define TEST(TYPE, TY)                     \
> > +  TYPE exract_last_##TY (sv##TYPE x)       \
> > +  {                                        \
> > +    svbool_t pg = svpfalse ();             \
> > +    return svlastb_##TY (pg, x);   \
> > +  }
> > +
> > +TEST(bfloat16_t, bf16)
> > +TEST(float16_t, f16)
> > +TEST(float32_t, f32)
> > +TEST(float64_t, f64)
> > +TEST(int8_t, s8)
> > +TEST(int16_t, s16)
> > +TEST(int32_t, s32)
> > +TEST(int64_t, s64)
> > +TEST(uint8_t, u8)
> > +TEST(uint16_t, u16)
> > +TEST(uint32_t, u32)
> > +TEST(uint64_t, u64)
> > +
> > +/* { dg-final { scan-assembler-times {\tdup\th0, v0\.h\[7\]} 2 } } */
> > +/* { dg-final { scan-assembler-times {\tdup\ts0, v0\.s\[3\]} 1 } } */
> > +/* { dg-final { scan-assembler-times {\tdup\td0, v0\.d\[1\]} 1 } } */
> > +/* { dg-final { scan-assembler-times {\tumov\tw0, v0\.h\[7\]} 2 } } */
> > +/* { dg-final { scan-assembler-times {\tumov\tw0, v0\.b\[15\]} 2 } } */
> > +/* { dg-final { scan-assembler-times {\tumov\tw0, v0\.s\[3\]} 2 } } */
> > +/* { dg-final { scan-assembler-times {\tumov\tx0, v0\.d\[1\]} 2 } } */
> > +/* { dg-final { scan-assembler-not "lastb" } } */
> > \ No newline at end of file

RE: [PATCH] aarch64: Optimize SVE extract last to Neon lane extract for 128-bit VLS.

Reply via email to