Re: [PATCH] aarch64: Optimize SVE extract last to Neon lane extract for 128-bit VLS.

Jennifer Schmitz Mon, 28 Apr 2025 03:40:18 -0700

> On 27 Apr 2025, at 08:42, Tamar Christina <tamar.christ...@arm.com> wrote:
> 
> External email: Use caution opening links or attachments
> 
> 
>> -----Original Message-----
>> From: Richard Sandiford <richard.sandif...@arm.com>
>> Sent: Friday, April 25, 2025 4:45 PM
>> To: Jennifer Schmitz <jschm...@nvidia.com>
>> Cc: gcc-patches@gcc.gnu.org
>> Subject: Re: [PATCH] aarch64: Optimize SVE extract last to Neon lane extract 
>> for
>> 128-bit VLS.
>> 
>> Jennifer Schmitz <jschm...@nvidia.com> writes:
>>> For the test case
>>> int32_t foo (svint32_t x)
>>> {
>>> svbool_t pg = svpfalse ();
>>> return svlastb_s32 (pg, x);
>>> }
>>> compiled with -O3 -mcpu=grace -msve-vector-bits=128, GCC produced:
>>> foo:
>>>    pfalse  p3.b
>>>    lastb   w0, p3, z0.s
>>>    ret
>>> when it could use a Neon lane extract instead:
>>> foo:
>>>    umov    w0, v0.s[3]
>>>    ret
>>> 
>>> We implemented this optimization by guarding the emission of
>>> pfalse+lastb in the pattern vec_extract<mode><Vel> by
>>> known_gt (BYTES_PER_SVE_VECTOR, 16). Thus, for a last-extract operation
>>> in 128-bit VLS, the pattern *vec_extract<mode><Vel>_v128 is used instead.
>>> 
>>> The patch was bootstrapped and tested on aarch64-linux-gnu, no regression.
>>> OK for mainline?
>>> 
>>> Signed-off-by: Jennifer Schmitz <jschm...@nvidia.com>
>>> 
>>> gcc/
>>>    * config/aarch64/aarch64-sve.md (vec_extract<mode><Vel>):
>>>    Prevent the emission of pfalse+lastb for 128-bit VLS.
>>> 
>>> gcc/testsuite/
>>>    * gcc.target/aarch64/sve/extract_last_128.c: New test.
>> 
>> OK, thanks.
>> 
> 
> Hi Both,
> 
> If I may, how about viewing this transformation differently.
> 
> How about instead make pfalse + lastb become rev + umov, which is
> faster on all Arm micro-architectures for both VLS and VLA and then
> the optimization becomes lowering rev + umov for VL128 into umov[3]?
> 
> This would speed up this case for everything.
> 
> It would also be good to have a test that checks that
> 
> #include <arm_sve.h>
> 
> int foo (svint32_t a)
> {
>    return svrev (a)[0];
> }
> 
> Works. At the moment, with -msve-vector-bits=128 GCC transforms
> this back into lastb, which is a de-optimization. Your current patch does fix 
> it
> so would be good to test this too.
Hi Richard and Tamar,
thanks for the feedback.
Tamar’s suggestion sounds like a more general solution.
I think we could either change define_expand "vec_extract<mode><Vel>”
to not emit pfalse+lastb in case the last element is extracted, but emit 
rev+umov instead,
or we could add a define_insn_and_split to change pfalse+lastb to rev+umov (and 
just umov for VL128). 
What do you think?
Thanks, Jennifer
> 
> Thanks,
> Tamar
> 
>> Richard
>> 
>>> ---
>>> gcc/config/aarch64/aarch64-sve.md             |  7 ++--
>>> .../gcc.target/aarch64/sve/extract_last_128.c | 33 +++++++++++++++++++
>>> 2 files changed, 37 insertions(+), 3 deletions(-)
>>> create mode 100644 gcc/testsuite/gcc.target/aarch64/sve/extract_last_128.c
>>> 
>>> diff --git a/gcc/config/aarch64/aarch64-sve.md b/gcc/config/aarch64/aarch64-
>> sve.md
>>> index 3dbd65986ec..824bd877e47 100644
>>> --- a/gcc/config/aarch64/aarch64-sve.md
>>> +++ b/gcc/config/aarch64/aarch64-sve.md
>>> @@ -2969,10 +2969,11 @@
>>>  {
>>>    poly_int64 val;
>>>    if (poly_int_rtx_p (operands[2], &val)
>>> -   && known_eq (val, GET_MODE_NUNITS (<MODE>mode) - 1))
>>> +   && known_eq (val, GET_MODE_NUNITS (<MODE>mode) - 1)
>>> +   && known_gt (BYTES_PER_SVE_VECTOR, 16))
>>>      {
>>> -   /* The last element can be extracted with a LASTB and a false
>>> -      predicate.  */
>>> +   /* Extract the last element with a LASTB and a false predicate.
>>> +      Exclude 128-bit VLS to use *vec_extract<mode><Vel>_v128.  */
>>>    rtx sel = aarch64_pfalse_reg (<VPRED>mode);
>>>    emit_insn (gen_extract_last_<mode> (operands[0], sel, operands[1]));
>>>    DONE;
>>> diff --git a/gcc/testsuite/gcc.target/aarch64/sve/extract_last_128.c
>> b/gcc/testsuite/gcc.target/aarch64/sve/extract_last_128.c
>>> new file mode 100644
>>> index 00000000000..71d3561ec60
>>> --- /dev/null
>>> +++ b/gcc/testsuite/gcc.target/aarch64/sve/extract_last_128.c
>>> @@ -0,0 +1,33 @@
>>> +/* { dg-do compile } */
>>> +/* { dg-options "-O3 -msve-vector-bits=128" } */
>>> +
>>> +#include <arm_sve.h>
>>> +
>>> +#define TEST(TYPE, TY)                     \
>>> +  TYPE exract_last_##TY (sv##TYPE x)       \
>>> +  {                                        \
>>> +    svbool_t pg = svpfalse ();             \
>>> +    return svlastb_##TY (pg, x);   \
>>> +  }
>>> +
>>> +TEST(bfloat16_t, bf16)
>>> +TEST(float16_t, f16)
>>> +TEST(float32_t, f32)
>>> +TEST(float64_t, f64)
>>> +TEST(int8_t, s8)
>>> +TEST(int16_t, s16)
>>> +TEST(int32_t, s32)
>>> +TEST(int64_t, s64)
>>> +TEST(uint8_t, u8)
>>> +TEST(uint16_t, u16)
>>> +TEST(uint32_t, u32)
>>> +TEST(uint64_t, u64)
>>> +
>>> +/* { dg-final { scan-assembler-times {\tdup\th0, v0\.h\[7\]} 2 } } */
>>> +/* { dg-final { scan-assembler-times {\tdup\ts0, v0\.s\[3\]} 1 } } */
>>> +/* { dg-final { scan-assembler-times {\tdup\td0, v0\.d\[1\]} 1 } } */
>>> +/* { dg-final { scan-assembler-times {\tumov\tw0, v0\.h\[7\]} 2 } } */
>>> +/* { dg-final { scan-assembler-times {\tumov\tw0, v0\.b\[15\]} 2 } } */
>>> +/* { dg-final { scan-assembler-times {\tumov\tw0, v0\.s\[3\]} 2 } } */
>>> +/* { dg-final { scan-assembler-times {\tumov\tx0, v0\.d\[1\]} 2 } } */
>>> +/* { dg-final { scan-assembler-not "lastb" } } */
>>> \ No newline at end of file
smime.p7s
Description: S/MIME cryptographic signature
Re: [PATCH] aarch64: Optimize SVE extract last to Neon lane extract for 128-bit VLS.

Reply via email to