Re: [PATCH] aarch64: Optimize SVE extract last to Neon lane extract for 128-bit VLS.

Richard Sandiford Mon, 28 Apr 2025 07:29:25 -0700

Jennifer Schmitz <jschm...@nvidia.com> writes:
>> On 27 Apr 2025, at 08:42, Tamar Christina <tamar.christ...@arm.com> wrote:
>> 
>> External email: Use caution opening links or attachments
>> 
>> 
>>> -----Original Message-----
>>> From: Richard Sandiford <richard.sandif...@arm.com>
>>> Sent: Friday, April 25, 2025 4:45 PM
>>> To: Jennifer Schmitz <jschm...@nvidia.com>
>>> Cc: gcc-patches@gcc.gnu.org
>>> Subject: Re: [PATCH] aarch64: Optimize SVE extract last to Neon lane 
>>> extract for
>>> 128-bit VLS.
>>> 
>>> Jennifer Schmitz <jschm...@nvidia.com> writes:
>>>> For the test case
>>>> int32_t foo (svint32_t x)
>>>> {
>>>> svbool_t pg = svpfalse ();
>>>> return svlastb_s32 (pg, x);
>>>> }
>>>> compiled with -O3 -mcpu=grace -msve-vector-bits=128, GCC produced:
>>>> foo:
>>>>    pfalse  p3.b
>>>>    lastb   w0, p3, z0.s
>>>>    ret
>>>> when it could use a Neon lane extract instead:
>>>> foo:
>>>>    umov    w0, v0.s[3]
>>>>    ret
>>>> 
>>>> We implemented this optimization by guarding the emission of
>>>> pfalse+lastb in the pattern vec_extract<mode><Vel> by
>>>> known_gt (BYTES_PER_SVE_VECTOR, 16). Thus, for a last-extract operation
>>>> in 128-bit VLS, the pattern *vec_extract<mode><Vel>_v128 is used instead.
>>>> 
>>>> The patch was bootstrapped and tested on aarch64-linux-gnu, no regression.
>>>> OK for mainline?
>>>> 
>>>> Signed-off-by: Jennifer Schmitz <jschm...@nvidia.com>
>>>> 
>>>> gcc/
>>>>    * config/aarch64/aarch64-sve.md (vec_extract<mode><Vel>):
>>>>    Prevent the emission of pfalse+lastb for 128-bit VLS.
>>>> 
>>>> gcc/testsuite/
>>>>    * gcc.target/aarch64/sve/extract_last_128.c: New test.
>>> 
>>> OK, thanks.
>>> 
>> 
>> Hi Both,
>> 
>> If I may, how about viewing this transformation differently.
>> 
>> How about instead make pfalse + lastb become rev + umov, which is
>> faster on all Arm micro-architectures for both VLS and VLA and then
>> the optimization becomes lowering rev + umov for VL128 into umov[3]?
>> 
>> This would speed up this case for everything.
>> 
>> It would also be good to have a test that checks that
>> 
>> #include <arm_sve.h>
>> 
>> int foo (svint32_t a)
>> {
>>    return svrev (a)[0];
>> }
>> 
>> Works. At the moment, with -msve-vector-bits=128 GCC transforms
>> this back into lastb, which is a de-optimization. Your current patch does 
>> fix it
>> so would be good to test this too.
> Hi Richard and Tamar,
> thanks for the feedback.
> Tamar’s suggestion sounds like a more general solution.
> I think we could either change define_expand "vec_extract<mode><Vel>”
> to not emit pfalse+lastb in case the last element is extracted, but emit 
> rev+umov instead,
> or we could add a define_insn_and_split to change pfalse+lastb to rev+umov 
> (and just umov for VL128).


I think we should do it in the expand.  But ISTM that there are
two separate things here:

(1) We shouldn't use the VLA expansion for VLS cases if the VLS case
    can do something simpler.

(2) We should improve the expansion based on what uarches have actually
    implemented.  (The current code predates h/w.)

I think we want (1) even if (2) involves using REV+UMOV, since for
(1) we still want a single UMOV for the top of the range.  And we'd
still want the tests in the patch.

So IMO, from a staging perspective, it makes sense to do (1) first
and treat (2) as a follow-on.

The REV approach isn't restricted to the last element.  It should be able
to handle:

  int8_t foo(svint8_t x) { return x[svcntb()-2]; }

as well.  In other words, if:

   auto idx = GET_MODE_NUNITS (mode) - val - 1;

is a constant in the range [0, 255], we can do a reverse and then
fall through to the normal CONST_INT expansion with idx as the index.

But I suppose that does mean that, rather than:

        && known_gt (BYTES_PER_SVE_VECTOR, 16))

we should instead check:

        && !val.is_constant ()

For 256-bit SVE we should use DUP instead of PLAST+LASTB or REV+UMOV.

Thanks,
Richard

Re: [PATCH] aarch64: Optimize SVE extract last to Neon lane extract for 128-bit VLS.

Reply via email to