> -----Original Message-----
> From: Jennifer Schmitz <jschm...@nvidia.com>
> Sent: Monday, April 28, 2025 11:40 AM
> To: gcc-patches@gcc.gnu.org
> Cc: Richard Sandiford <richard.sandif...@arm.com>; Tamar Christina
> <tamar.christ...@arm.com>
> Subject: Re: [PATCH] aarch64: Optimize SVE extract last to Neon lane extract 
> for
> 128-bit VLS.
> 
> 
> 
> > On 27 Apr 2025, at 08:42, Tamar Christina <tamar.christ...@arm.com> wrote:
> >
> > External email: Use caution opening links or attachments
> >
> >
> >> -----Original Message-----
> >> From: Richard Sandiford <richard.sandif...@arm.com>
> >> Sent: Friday, April 25, 2025 4:45 PM
> >> To: Jennifer Schmitz <jschm...@nvidia.com>
> >> Cc: gcc-patches@gcc.gnu.org
> >> Subject: Re: [PATCH] aarch64: Optimize SVE extract last to Neon lane 
> >> extract for
> >> 128-bit VLS.
> >>
> >> Jennifer Schmitz <jschm...@nvidia.com> writes:
> >>> For the test case
> >>> int32_t foo (svint32_t x)
> >>> {
> >>> svbool_t pg = svpfalse ();
> >>> return svlastb_s32 (pg, x);
> >>> }
> >>> compiled with -O3 -mcpu=grace -msve-vector-bits=128, GCC produced:
> >>> foo:
> >>>    pfalse  p3.b
> >>>    lastb   w0, p3, z0.s
> >>>    ret
> >>> when it could use a Neon lane extract instead:
> >>> foo:
> >>>    umov    w0, v0.s[3]
> >>>    ret
> >>>
> >>> We implemented this optimization by guarding the emission of
> >>> pfalse+lastb in the pattern vec_extract<mode><Vel> by
> >>> known_gt (BYTES_PER_SVE_VECTOR, 16). Thus, for a last-extract operation
> >>> in 128-bit VLS, the pattern *vec_extract<mode><Vel>_v128 is used instead.
> >>>
> >>> The patch was bootstrapped and tested on aarch64-linux-gnu, no regression.
> >>> OK for mainline?
> >>>
> >>> Signed-off-by: Jennifer Schmitz <jschm...@nvidia.com>
> >>>
> >>> gcc/
> >>>    * config/aarch64/aarch64-sve.md (vec_extract<mode><Vel>):
> >>>    Prevent the emission of pfalse+lastb for 128-bit VLS.
> >>>
> >>> gcc/testsuite/
> >>>    * gcc.target/aarch64/sve/extract_last_128.c: New test.
> >>
> >> OK, thanks.
> >>
> >
> > Hi Both,
> >
> > If I may, how about viewing this transformation differently.
> >
> > How about instead make pfalse + lastb become rev + umov, which is
> > faster on all Arm micro-architectures for both VLS and VLA and then
> > the optimization becomes lowering rev + umov for VL128 into umov[3]?
> >
> > This would speed up this case for everything.
> >
> > It would also be good to have a test that checks that
> >
> > #include <arm_sve.h>
> >
> > int foo (svint32_t a)
> > {
> >    return svrev (a)[0];
> > }
> >
> > Works. At the moment, with -msve-vector-bits=128 GCC transforms
> > this back into lastb, which is a de-optimization. Your current patch does 
> > fix it
> > so would be good to test this too.
> Hi Richard and Tamar,
> thanks for the feedback.
> Tamar’s suggestion sounds like a more general solution.
> I think we could either change define_expand "vec_extract<mode><Vel>”
> to not emit pfalse+lastb in case the last element is extracted, but emit 
> rev+umov
> instead,

Hi Jennifer,

Last say if Richard's ofcourse, but I'd have done it that way, it seems more 
appropriate
to just generate the REV directly as it's always going to be what we want.

Thanks for doing it,
Tamar

> or we could add a define_insn_and_split to change pfalse+lastb to rev+umov 
> (and
> just umov for VL128).
> What do you think?
> Thanks, Jennifer
> >
> > Thanks,
> > Tamar
> >
> >> Richard
> >>
> >>> ---
> >>> gcc/config/aarch64/aarch64-sve.md             |  7 ++--
> >>> .../gcc.target/aarch64/sve/extract_last_128.c | 33 +++++++++++++++++++
> >>> 2 files changed, 37 insertions(+), 3 deletions(-)
> >>> create mode 100644 gcc/testsuite/gcc.target/aarch64/sve/extract_last_128.c
> >>>
> >>> diff --git a/gcc/config/aarch64/aarch64-sve.md
> b/gcc/config/aarch64/aarch64-
> >> sve.md
> >>> index 3dbd65986ec..824bd877e47 100644
> >>> --- a/gcc/config/aarch64/aarch64-sve.md
> >>> +++ b/gcc/config/aarch64/aarch64-sve.md
> >>> @@ -2969,10 +2969,11 @@
> >>>  {
> >>>    poly_int64 val;
> >>>    if (poly_int_rtx_p (operands[2], &val)
> >>> -   && known_eq (val, GET_MODE_NUNITS (<MODE>mode) - 1))
> >>> +   && known_eq (val, GET_MODE_NUNITS (<MODE>mode) - 1)
> >>> +   && known_gt (BYTES_PER_SVE_VECTOR, 16))
> >>>      {
> >>> -   /* The last element can be extracted with a LASTB and a false
> >>> -      predicate.  */
> >>> +   /* Extract the last element with a LASTB and a false predicate.
> >>> +      Exclude 128-bit VLS to use *vec_extract<mode><Vel>_v128.  */
> >>>    rtx sel = aarch64_pfalse_reg (<VPRED>mode);
> >>>    emit_insn (gen_extract_last_<mode> (operands[0], sel, operands[1]));
> >>>    DONE;
> >>> diff --git a/gcc/testsuite/gcc.target/aarch64/sve/extract_last_128.c
> >> b/gcc/testsuite/gcc.target/aarch64/sve/extract_last_128.c
> >>> new file mode 100644
> >>> index 00000000000..71d3561ec60
> >>> --- /dev/null
> >>> +++ b/gcc/testsuite/gcc.target/aarch64/sve/extract_last_128.c
> >>> @@ -0,0 +1,33 @@
> >>> +/* { dg-do compile } */
> >>> +/* { dg-options "-O3 -msve-vector-bits=128" } */
> >>> +
> >>> +#include <arm_sve.h>
> >>> +
> >>> +#define TEST(TYPE, TY)                     \
> >>> +  TYPE exract_last_##TY (sv##TYPE x)       \
> >>> +  {                                        \
> >>> +    svbool_t pg = svpfalse ();             \
> >>> +    return svlastb_##TY (pg, x);   \
> >>> +  }
> >>> +
> >>> +TEST(bfloat16_t, bf16)
> >>> +TEST(float16_t, f16)
> >>> +TEST(float32_t, f32)
> >>> +TEST(float64_t, f64)
> >>> +TEST(int8_t, s8)
> >>> +TEST(int16_t, s16)
> >>> +TEST(int32_t, s32)
> >>> +TEST(int64_t, s64)
> >>> +TEST(uint8_t, u8)
> >>> +TEST(uint16_t, u16)
> >>> +TEST(uint32_t, u32)
> >>> +TEST(uint64_t, u64)
> >>> +
> >>> +/* { dg-final { scan-assembler-times {\tdup\th0, v0\.h\[7\]} 2 } } */
> >>> +/* { dg-final { scan-assembler-times {\tdup\ts0, v0\.s\[3\]} 1 } } */
> >>> +/* { dg-final { scan-assembler-times {\tdup\td0, v0\.d\[1\]} 1 } } */
> >>> +/* { dg-final { scan-assembler-times {\tumov\tw0, v0\.h\[7\]} 2 } } */
> >>> +/* { dg-final { scan-assembler-times {\tumov\tw0, v0\.b\[15\]} 2 } } */
> >>> +/* { dg-final { scan-assembler-times {\tumov\tw0, v0\.s\[3\]} 2 } } */
> >>> +/* { dg-final { scan-assembler-times {\tumov\tx0, v0\.d\[1\]} 2 } } */
> >>> +/* { dg-final { scan-assembler-not "lastb" } } */
> >>> \ No newline at end of file
> 

Reply via email to