> On 28 Apr 2025, at 16:29, Richard Sandiford <richard.sandif...@arm.com> wrote: > > External email: Use caution opening links or attachments > > > Jennifer Schmitz <jschm...@nvidia.com> writes: >>> On 27 Apr 2025, at 08:42, Tamar Christina <tamar.christ...@arm.com> wrote: >>> >>> External email: Use caution opening links or attachments >>> >>> >>>> -----Original Message----- >>>> From: Richard Sandiford <richard.sandif...@arm.com> >>>> Sent: Friday, April 25, 2025 4:45 PM >>>> To: Jennifer Schmitz <jschm...@nvidia.com> >>>> Cc: gcc-patches@gcc.gnu.org >>>> Subject: Re: [PATCH] aarch64: Optimize SVE extract last to Neon lane >>>> extract for >>>> 128-bit VLS. >>>> >>>> Jennifer Schmitz <jschm...@nvidia.com> writes: >>>>> For the test case >>>>> int32_t foo (svint32_t x) >>>>> { >>>>> svbool_t pg = svpfalse (); >>>>> return svlastb_s32 (pg, x); >>>>> } >>>>> compiled with -O3 -mcpu=grace -msve-vector-bits=128, GCC produced: >>>>> foo: >>>>> pfalse p3.b >>>>> lastb w0, p3, z0.s >>>>> ret >>>>> when it could use a Neon lane extract instead: >>>>> foo: >>>>> umov w0, v0.s[3] >>>>> ret >>>>> >>>>> We implemented this optimization by guarding the emission of >>>>> pfalse+lastb in the pattern vec_extract<mode><Vel> by >>>>> known_gt (BYTES_PER_SVE_VECTOR, 16). Thus, for a last-extract operation >>>>> in 128-bit VLS, the pattern *vec_extract<mode><Vel>_v128 is used instead. >>>>> >>>>> The patch was bootstrapped and tested on aarch64-linux-gnu, no regression. >>>>> OK for mainline? >>>>> >>>>> Signed-off-by: Jennifer Schmitz <jschm...@nvidia.com> >>>>> >>>>> gcc/ >>>>> * config/aarch64/aarch64-sve.md (vec_extract<mode><Vel>): >>>>> Prevent the emission of pfalse+lastb for 128-bit VLS. >>>>> >>>>> gcc/testsuite/ >>>>> * gcc.target/aarch64/sve/extract_last_128.c: New test. >>>> >>>> OK, thanks. >>>> >>> >>> Hi Both, >>> >>> If I may, how about viewing this transformation differently. >>> >>> How about instead make pfalse + lastb become rev + umov, which is >>> faster on all Arm micro-architectures for both VLS and VLA and then >>> the optimization becomes lowering rev + umov for VL128 into umov[3]? >>> >>> This would speed up this case for everything. >>> >>> It would also be good to have a test that checks that >>> >>> #include <arm_sve.h> >>> >>> int foo (svint32_t a) >>> { >>> return svrev (a)[0]; >>> } >>> >>> Works. At the moment, with -msve-vector-bits=128 GCC transforms >>> this back into lastb, which is a de-optimization. Your current patch does >>> fix it >>> so would be good to test this too. >> Hi Richard and Tamar, >> thanks for the feedback. >> Tamar’s suggestion sounds like a more general solution. >> I think we could either change define_expand "vec_extract<mode><Vel>” >> to not emit pfalse+lastb in case the last element is extracted, but emit >> rev+umov instead, >> or we could add a define_insn_and_split to change pfalse+lastb to rev+umov >> (and just umov for VL128). > > I think we should do it in the expand. But ISTM that there are > two separate things here: > > (1) We shouldn't use the VLA expansion for VLS cases if the VLS case > can do something simpler. > > (2) We should improve the expansion based on what uarches have actually > implemented. (The current code predates h/w.) > > I think we want (1) even if (2) involves using REV+UMOV, since for > (1) we still want a single UMOV for the top of the range. And we'd > still want the tests in the patch. > > So IMO, from a staging perspective, it makes sense to do (1) first > and treat (2) as a follow-on. > > The REV approach isn't restricted to the last element. It should be able > to handle: > > int8_t foo(svint8_t x) { return x[svcntb()-2]; } > > as well. In other words, if: > > auto idx = GET_MODE_NUNITS (mode) - val - 1; > > is a constant in the range [0, 255], we can do a reverse and then > fall through to the normal CONST_INT expansion with idx as the index. > > But I suppose that does mean that, rather than: > > && known_gt (BYTES_PER_SVE_VECTOR, 16)) > > we should instead check: > > && !val.is_constant () > > For 256-bit SVE we should use DUP instead of PLAST+LASTB or REV+UMOV. I adjusted the patch to use && !val.is_constant () as check, such that the extraction of the last element is matched by other patterns for VLS of all vector widths. 256-bit and 512-bit VLS are now matched by *vec_extract<mode><Vel>_dup and 1024-bit and 2048-bit by *vec_extract<mode><Vel>_ext. There were existing tests for 256-bit, 512-bit, 1024-bit and 2048-bit that could be adjusted. 128-bit VLS is covered by the new test in the patch. Let me know if I shall change it to be more analogous to the other extract_*.c tests.
About implementing point (2): Do we only want the REV approach you outlined for VLA, now that VLS is redirected to other patterns? Thanks, Jennifer For the test case int32_t foo (svint32_t x) { svbool_t pg = svpfalse (); return svlastb_s32 (pg, x); } compiled with -O3 -mcpu=grace -msve-vector-bits=128, GCC produced: foo: pfalse p3.b lastb w0, p3, z0.s ret when it could use a Neon lane extract instead: foo: umov w0, v0.s[3] ret Similar optimizations can be made for VLS with other vector widths. We implemented this optimization by guarding the emission of pfalse+lastb in the pattern vec_extract<mode><Vel> by !val.is_constant (). Thus, for last-extract operations with VLS, the patterns *vec_extract<mode><Vel>_v128, *vec_extract<mode><Vel>_dup, or *vec_extract<mode><Vel>_ext are used instead. We added tests for 128-bit VLS and adjusted the tests for the other vector widths. The patch was bootstrapped and tested on aarch64-linux-gnu, no regression. OK for mainline? Signed-off-by: Jennifer Schmitz <jschm...@nvidia.com> gcc/ * config/aarch64/aarch64-sve.md (vec_extract<mode><Vel>): Prevent the emission of pfalse+lastb for VLS. gcc/testsuite/ * gcc.target/aarch64/sve/extract_last_128.c: New test. * gcc.target/aarch64/sve/extract_1.c: Adjust expected outcome. * gcc.target/aarch64/sve/extract_2.c: Likewise. * gcc.target/aarch64/sve/extract_3.c: Likewise. * gcc.target/aarch64/sve/extract_4.c: Likewise. --- gcc/config/aarch64/aarch64-sve.md | 7 ++-- .../gcc.target/aarch64/sve/extract_1.c | 22 +++++-------- .../gcc.target/aarch64/sve/extract_2.c | 23 ++++++------- .../gcc.target/aarch64/sve/extract_3.c | 23 ++++++------- .../gcc.target/aarch64/sve/extract_4.c | 23 ++++++------- .../gcc.target/aarch64/sve/extract_last_128.c | 33 +++++++++++++++++++ 6 files changed, 76 insertions(+), 55 deletions(-) create mode 100644 gcc/testsuite/gcc.target/aarch64/sve/extract_last_128.c diff --git a/gcc/config/aarch64/aarch64-sve.md b/gcc/config/aarch64/aarch64-sve.md index d4af3706294..7bf12ff25cc 100644 --- a/gcc/config/aarch64/aarch64-sve.md +++ b/gcc/config/aarch64/aarch64-sve.md @@ -2969,10 +2969,11 @@ { poly_int64 val; if (poly_int_rtx_p (operands[2], &val) - && known_eq (val, GET_MODE_NUNITS (<MODE>mode) - 1)) + && known_eq (val, GET_MODE_NUNITS (<MODE>mode) - 1) + && !val.is_constant ()) { - /* The last element can be extracted with a LASTB and a false - predicate. */ + /* For VLA, extract the last element with a LASTB and a false + predicate. */ rtx sel = aarch64_pfalse_reg (<VPRED>mode); emit_insn (gen_extract_last_<mode> (operands[0], sel, operands[1])); DONE; diff --git a/gcc/testsuite/gcc.target/aarch64/sve/extract_1.c b/gcc/testsuite/gcc.target/aarch64/sve/extract_1.c index 5d5edf26c19..92ae81cc4e3 100644 --- a/gcc/testsuite/gcc.target/aarch64/sve/extract_1.c +++ b/gcc/testsuite/gcc.target/aarch64/sve/extract_1.c @@ -56,40 +56,36 @@ typedef _Float16 vnx8hf __attribute__((vector_size (32))); TEST_ALL (EXTRACT) -/* { dg-final { scan-assembler-times {\tfmov\tx[0-9]+, d[0-9]+\n} 2 { target aarch64_little_endian } } } */ -/* { dg-final { scan-assembler-times {\tumov\tx[0-9]+, v[0-9]+\.d\[0\]\n} 1 { target aarch64_big_endian } } } */ +/* { dg-final { scan-assembler-times {\tfmov\tx[0-9]+, d[0-9]+\n} 3 { target aarch64_little_endian } } } */ +/* { dg-final { scan-assembler-times {\tumov\tx[0-9]+, v[0-9]+\.d\[0\]\n} 2 { target aarch64_big_endian } } } */ /* { dg-final { scan-assembler-times {\tumov\tx[0-9]+, v[0-9]+\.d\[1\]\n} 1 } } */ /* { dg-final { scan-assembler-not {\tdup\td[0-9]+, v[0-9]+\.d\[0\]\n} } } */ /* { dg-final { scan-assembler-times {\tdup\td[0-9]+, v[0-9]+\.d\[1\]\n} 1 } } */ /* { dg-final { scan-assembler-times {\tdup\tz[0-9]+\.d, z[0-9]+\.d\[2\]\n} 2 } } */ -/* { dg-final { scan-assembler-times {\tlastb\tx[0-9]+, p[0-7], z[0-9]+\.d\n} 1 } } */ -/* { dg-final { scan-assembler-times {\tlastb\td[0-9]+, p[0-7], z[0-9]+\.d\n} 1 } } */ +/* { dg-final { scan-assembler-times {\tdup\tz[0-9]+\.d, z[0-9]+\.d\[3\]\n} 2 } } */ -/* { dg-final { scan-assembler-times {\tfmov\tw[0-9]+, s[0-9]+\n} 2 { target aarch64_little_endian } } } */ -/* { dg-final { scan-assembler-times {\tumov\tw[0-9]+, v[0-9]+\.s\[0\]\n} 1 { target aarch64_big_endian } } } */ +/* { dg-final { scan-assembler-times {\tfmov\tw[0-9]+, s[0-9]+\n} 3 { target aarch64_little_endian } } } */ +/* { dg-final { scan-assembler-times {\tumov\tw[0-9]+, v[0-9]+\.s\[0\]\n} 2 { target aarch64_big_endian } } } */ /* { dg-final { scan-assembler-times {\tumov\tw[0-9]+, v[0-9]+\.s\[1\]\n} 1 } } */ /* { dg-final { scan-assembler-times {\tumov\tw[0-9]+, v[0-9]+\.s\[3\]\n} 1 } } */ /* { dg-final { scan-assembler-not {\tdup\ts[0-9]+, v[0-9]+\.s\[0\]\n} } } */ /* { dg-final { scan-assembler-times {\tdup\ts[0-9]+, v[0-9]+\.s\[1\]\n} 1 } } */ /* { dg-final { scan-assembler-times {\tdup\ts[0-9]+, v[0-9]+\.s\[3\]\n} 1 } } */ /* { dg-final { scan-assembler-times {\tdup\tz[0-9]+\.s, z[0-9]+\.s\[4\]\n} 2 } } */ -/* { dg-final { scan-assembler-times {\tlastb\tw[0-9]+, p[0-7], z[0-9]+\.s\n} 1 } } */ -/* { dg-final { scan-assembler-times {\tlastb\ts[0-9]+, p[0-7], z[0-9]+\.s\n} 1 } } */ +/* { dg-final { scan-assembler-times {\tdup\tz[0-9]+\.s, z[0-9]+\.s\[7\]\n} 2 } } */ /* Also used to move the result of a non-Advanced SIMD extract. */ -/* { dg-final { scan-assembler-times {\tumov\tw[0-9]+, v[0-9]+\.h\[0\]\n} 2 } } */ +/* { dg-final { scan-assembler-times {\tumov\tw[0-9]+, v[0-9]+\.h\[0\]\n} 3 } } */ /* { dg-final { scan-assembler-times {\tumov\tw[0-9]+, v[0-9]+\.h\[1\]\n} 1 } } */ /* { dg-final { scan-assembler-times {\tumov\tw[0-9]+, v[0-9]+\.h\[7\]\n} 1 } } */ /* { dg-final { scan-assembler-not {\tdup\th[0-9]+, v[0-9]+\.h\[0\]\n} } } */ /* { dg-final { scan-assembler-times {\tdup\th[0-9]+, v[0-9]+\.h\[1\]\n} 1 } } */ /* { dg-final { scan-assembler-times {\tdup\th[0-9]+, v[0-9]+\.h\[7\]\n} 1 } } */ /* { dg-final { scan-assembler-times {\tdup\tz[0-9]+\.h, z[0-9]+\.h\[8\]\n} 2 } } */ -/* { dg-final { scan-assembler-times {\tlastb\tw[0-9]+, p[0-7], z[0-9]+\.h\n} 1 } } */ -/* { dg-final { scan-assembler-times {\tlastb\th[0-9]+, p[0-7], z[0-9]+\.h\n} 1 } } */ +/* { dg-final { scan-assembler-times {\tdup\tz[0-9]+\.h, z[0-9]+\.h\[15\]\n} 2 } } */ /* Also used to move the result of a non-Advanced SIMD extract. */ -/* { dg-final { scan-assembler-times {\tumov\tw[0-9]+, v[0-9]+\.b\[0\]\n} 2 } } */ +/* { dg-final { scan-assembler-times {\tumov\tw[0-9]+, v[0-9]+\.b\[0\]\n} 3 } } */ /* { dg-final { scan-assembler-times {\tumov\tw[0-9]+, v[0-9]+\.b\[1\]\n} 1 } } */ /* { dg-final { scan-assembler-times {\tumov\tw[0-9]+, v[0-9]+\.b\[15\]\n} 1 } } */ /* { dg-final { scan-assembler-times {\tdup\tz[0-9]+\.b, z[0-9]+\.b\[16\]\n} 1 } } */ -/* { dg-final { scan-assembler-times {\tlastb\tw[0-9]+, p[0-7], z[0-9]+\.b\n} 1 } } */ diff --git a/gcc/testsuite/gcc.target/aarch64/sve/extract_2.c b/gcc/testsuite/gcc.target/aarch64/sve/extract_2.c index 0e6ec836228..a3886b2de22 100644 --- a/gcc/testsuite/gcc.target/aarch64/sve/extract_2.c +++ b/gcc/testsuite/gcc.target/aarch64/sve/extract_2.c @@ -56,40 +56,37 @@ typedef _Float16 vnx16hf __attribute__((vector_size (64))); TEST_ALL (EXTRACT) -/* { dg-final { scan-assembler-times {\tfmov\tx[0-9]+, d[0-9]+\n} 2 { target aarch64_little_endian } } } */ -/* { dg-final { scan-assembler-times {\tumov\tx[0-9]+, v[0-9]+\.d\[0\]\n} 1 { target aarch64_big_endian } } } */ +/* { dg-final { scan-assembler-times {\tfmov\tx[0-9]+, d[0-9]+\n} 3 { target aarch64_little_endian } } } */ +/* { dg-final { scan-assembler-times {\tumov\tx[0-9]+, v[0-9]+\.d\[0\]\n} 2 { target aarch64_big_endian } } } */ /* { dg-final { scan-assembler-times {\tumov\tx[0-9]+, v[0-9]+\.d\[1\]\n} 1 } } */ /* { dg-final { scan-assembler-not {\tdup\td[0-9]+, v[0-9]+\.d\[0\]\n} } } */ /* { dg-final { scan-assembler-times {\tdup\td[0-9]+, v[0-9]+\.d\[1\]\n} 1 } } */ /* { dg-final { scan-assembler-times {\tdup\tz[0-9]+\.d, z[0-9]+\.d\[2\]\n} 2 } } */ -/* { dg-final { scan-assembler-times {\tlastb\tx[0-9]+, p[0-7], z[0-9]+\.d\n} 1 } } */ -/* { dg-final { scan-assembler-times {\tlastb\td[0-9]+, p[0-7], z[0-9]+\.d\n} 1 } } */ +/* { dg-final { scan-assembler-times {\tdup\tz[0-9]+\.d, z[0-9]+\.d\[7\]\n} 2 } } */ -/* { dg-final { scan-assembler-times {\tfmov\tw[0-9]+, s[0-9]+\n} 2 { target aarch64_little_endian } } } */ -/* { dg-final { scan-assembler-times {\tumov\tw[0-9]+, v[0-9]+\.s\[0\]\n} 1 { target aarch64_big_endian } } } */ +/* { dg-final { scan-assembler-times {\tfmov\tw[0-9]+, s[0-9]+\n} 3 { target aarch64_little_endian } } } */ +/* { dg-final { scan-assembler-times {\tumov\tw[0-9]+, v[0-9]+\.s\[0\]\n} 2 { target aarch64_big_endian } } } */ /* { dg-final { scan-assembler-times {\tumov\tw[0-9]+, v[0-9]+\.s\[1\]\n} 1 } } */ /* { dg-final { scan-assembler-times {\tumov\tw[0-9]+, v[0-9]+\.s\[3\]\n} 1 } } */ /* { dg-final { scan-assembler-not {\tdup\ts[0-9]+, v[0-9]+\.s\[0\]\n} } } */ /* { dg-final { scan-assembler-times {\tdup\ts[0-9]+, v[0-9]+\.s\[1\]\n} 1 } } */ /* { dg-final { scan-assembler-times {\tdup\ts[0-9]+, v[0-9]+\.s\[3\]\n} 1 } } */ /* { dg-final { scan-assembler-times {\tdup\tz[0-9]+\.s, z[0-9]+\.s\[4\]\n} 2 } } */ -/* { dg-final { scan-assembler-times {\tlastb\tw[0-9]+, p[0-7], z[0-9]+\.s\n} 1 } } */ -/* { dg-final { scan-assembler-times {\tlastb\ts[0-9]+, p[0-7], z[0-9]+\.s\n} 1 } } */ +/* { dg-final { scan-assembler-times {\tdup\tz[0-9]+\.s, z[0-9]+\.s\[15\]\n} 2 } } */ /* Also used to move the result of a non-Advanced SIMD extract. */ -/* { dg-final { scan-assembler-times {\tumov\tw[0-9]+, v[0-9]+\.h\[0\]\n} 2 } } */ +/* { dg-final { scan-assembler-times {\tumov\tw[0-9]+, v[0-9]+\.h\[0\]\n} 3 } } */ /* { dg-final { scan-assembler-times {\tumov\tw[0-9]+, v[0-9]+\.h\[1\]\n} 1 } } */ /* { dg-final { scan-assembler-times {\tumov\tw[0-9]+, v[0-9]+\.h\[7\]\n} 1 } } */ /* { dg-final { scan-assembler-not {\tdup\th[0-9]+, v[0-9]+\.h\[0\]\n} } } */ /* { dg-final { scan-assembler-times {\tdup\th[0-9]+, v[0-9]+\.h\[1\]\n} 1 } } */ /* { dg-final { scan-assembler-times {\tdup\th[0-9]+, v[0-9]+\.h\[7\]\n} 1 } } */ /* { dg-final { scan-assembler-times {\tdup\tz[0-9]+\.h, z[0-9]+\.h\[8\]\n} 2 } } */ -/* { dg-final { scan-assembler-times {\tlastb\tw[0-9]+, p[0-7], z[0-9]+\.h\n} 1 } } */ -/* { dg-final { scan-assembler-times {\tlastb\th[0-9]+, p[0-7], z[0-9]+\.h\n} 1 } } */ +/* { dg-final { scan-assembler-times {\tdup\tz[0-9]+\.h, z[0-9]+\.h\[31\]\n} 2 } } */ /* Also used to move the result of a non-Advanced SIMD extract. */ -/* { dg-final { scan-assembler-times {\tumov\tw[0-9]+, v[0-9]+\.b\[0\]\n} 2 } } */ +/* { dg-final { scan-assembler-times {\tumov\tw[0-9]+, v[0-9]+\.b\[0\]\n} 3 } } */ /* { dg-final { scan-assembler-times {\tumov\tw[0-9]+, v[0-9]+\.b\[1\]\n} 1 } } */ /* { dg-final { scan-assembler-times {\tumov\tw[0-9]+, v[0-9]+\.b\[15\]\n} 1 } } */ /* { dg-final { scan-assembler-times {\tdup\tz[0-9]+\.b, z[0-9]+\.b\[16\]\n} 1 } } */ -/* { dg-final { scan-assembler-times {\tlastb\tw[0-9]+, p[0-7], z[0-9]+\.b\n} 1 } } */ +/* { dg-final { scan-assembler-times {\tdup\tz[0-9]+\.b, z[0-9]+\.b\[63\]\n} 1 } } */ diff --git a/gcc/testsuite/gcc.target/aarch64/sve/extract_3.c b/gcc/testsuite/gcc.target/aarch64/sve/extract_3.c index 0d7a2fa2527..c22b8a952de 100644 --- a/gcc/testsuite/gcc.target/aarch64/sve/extract_3.c +++ b/gcc/testsuite/gcc.target/aarch64/sve/extract_3.c @@ -77,18 +77,16 @@ typedef _Float16 vnx32hf __attribute__((vector_size (128))); TEST_ALL (EXTRACT) -/* { dg-final { scan-assembler-times {\tfmov\tx[0-9]+, d[0-9]+\n} 5 { target aarch64_little_endian } } } */ -/* { dg-final { scan-assembler-times {\tumov\tx[0-9]+, v[0-9]+\.d\[0\]\n} 1 { target aarch64_big_endian } } } */ +/* { dg-final { scan-assembler-times {\tfmov\tx[0-9]+, d[0-9]+\n} 6 { target aarch64_little_endian } } } */ +/* { dg-final { scan-assembler-times {\tumov\tx[0-9]+, v[0-9]+\.d\[0\]\n} 2 { target aarch64_big_endian } } } */ /* { dg-final { scan-assembler-times {\tumov\tx[0-9]+, v[0-9]+\.d\[1\]\n} 1 } } */ /* { dg-final { scan-assembler-not {\tdup\td[0-9]+, v[0-9]+\.d\[0\]\n} } } */ /* { dg-final { scan-assembler-times {\tdup\td[0-9]+, v[0-9]+\.d\[1\]\n} 1 } } */ /* { dg-final { scan-assembler-times {\tdup\tz[0-9]+\.d, z[0-9]+\.d\[2\]\n} 2 } } */ /* { dg-final { scan-assembler-times {\tdup\tz[0-9]+\.d, z[0-9]+\.d\[7\]\n} 2 } } */ -/* { dg-final { scan-assembler-times {\tlastb\tx[0-9]+, p[0-7], z[0-9]+\.d\n} 1 } } */ -/* { dg-final { scan-assembler-times {\tlastb\td[0-9]+, p[0-7], z[0-9]+\.d\n} 1 } } */ -/* { dg-final { scan-assembler-times {\tfmov\tw[0-9]+, s[0-9]+\n} 5 { target aarch64_little_endian } } } */ -/* { dg-final { scan-assembler-times {\tumov\tw[0-9]+, v[0-9]+\.s\[0\]\n} 1 { target aarch64_big_endian } } } */ +/* { dg-final { scan-assembler-times {\tfmov\tw[0-9]+, s[0-9]+\n} 6 { target aarch64_little_endian } } } */ +/* { dg-final { scan-assembler-times {\tumov\tw[0-9]+, v[0-9]+\.s\[0\]\n} 2 { target aarch64_big_endian } } } */ /* { dg-final { scan-assembler-times {\tumov\tw[0-9]+, v[0-9]+\.s\[1\]\n} 1 } } */ /* { dg-final { scan-assembler-times {\tumov\tw[0-9]+, v[0-9]+\.s\[3\]\n} 1 } } */ /* { dg-final { scan-assembler-not {\tdup\ts[0-9]+, v[0-9]+\.s\[0\]\n} } } */ @@ -96,11 +94,9 @@ TEST_ALL (EXTRACT) /* { dg-final { scan-assembler-times {\tdup\ts[0-9]+, v[0-9]+\.s\[3\]\n} 1 } } */ /* { dg-final { scan-assembler-times {\tdup\tz[0-9]+\.s, z[0-9]+\.s\[4\]\n} 2 } } */ /* { dg-final { scan-assembler-times {\tdup\tz[0-9]+\.s, z[0-9]+\.s\[15\]\n} 2 } } */ -/* { dg-final { scan-assembler-times {\tlastb\tw[0-9]+, p[0-7], z[0-9]+\.s\n} 1 } } */ -/* { dg-final { scan-assembler-times {\tlastb\ts[0-9]+, p[0-7], z[0-9]+\.s\n} 1 } } */ /* Also used to move the result of a non-Advanced SIMD extract. */ -/* { dg-final { scan-assembler-times {\tumov\tw[0-9]+, v[0-9]+\.h\[0\]\n} 5 } } */ +/* { dg-final { scan-assembler-times {\tumov\tw[0-9]+, v[0-9]+\.h\[0\]\n} 6 } } */ /* { dg-final { scan-assembler-times {\tumov\tw[0-9]+, v[0-9]+\.h\[1\]\n} 1 } } */ /* { dg-final { scan-assembler-times {\tumov\tw[0-9]+, v[0-9]+\.h\[7\]\n} 1 } } */ /* { dg-final { scan-assembler-not {\tdup\th[0-9]+, v[0-9]+\.h\[0\]\n} } } */ @@ -108,19 +104,20 @@ TEST_ALL (EXTRACT) /* { dg-final { scan-assembler-times {\tdup\th[0-9]+, v[0-9]+\.h\[7\]\n} 1 } } */ /* { dg-final { scan-assembler-times {\tdup\tz[0-9]+\.h, z[0-9]+\.h\[8\]\n} 2 } } */ /* { dg-final { scan-assembler-times {\tdup\tz[0-9]+\.h, z[0-9]+\.h\[31\]\n} 2 } } */ -/* { dg-final { scan-assembler-times {\tlastb\tw[0-9]+, p[0-7], z[0-9]+\.h\n} 1 } } */ -/* { dg-final { scan-assembler-times {\tlastb\th[0-9]+, p[0-7], z[0-9]+\.h\n} 1 } } */ /* Also used to move the result of a non-Advanced SIMD extract. */ -/* { dg-final { scan-assembler-times {\tumov\tw[0-9]+, v[0-9]+\.b\[0\]\n} 5 } } */ +/* { dg-final { scan-assembler-times {\tumov\tw[0-9]+, v[0-9]+\.b\[0\]\n} 6 } } */ /* { dg-final { scan-assembler-times {\tumov\tw[0-9]+, v[0-9]+\.b\[1\]\n} 1 } } */ /* { dg-final { scan-assembler-times {\tumov\tw[0-9]+, v[0-9]+\.b\[15\]\n} 1 } } */ /* { dg-final { scan-assembler-times {\tdup\tz[0-9]+\.b, z[0-9]+\.b\[16\]\n} 1 } } */ /* { dg-final { scan-assembler-times {\tdup\tz[0-9]+\.b, z[0-9]+\.b\[63\]\n} 1 } } */ -/* { dg-final { scan-assembler-times {\tlastb\tw[0-9]+, p[0-7], z[0-9]+\.b\n} 1 } } */ /* { dg-final { scan-assembler-times {\text\tz[0-9]+\.b, z[0-9]+\.b, z[0-9]+\.b, #64\n} 7 } } */ /* { dg-final { scan-assembler-times {\text\tz[0-9]+\.b, z[0-9]+\.b, z[0-9]+\.b, #72\n} 2 } } */ /* { dg-final { scan-assembler-times {\text\tz[0-9]+\.b, z[0-9]+\.b, z[0-9]+\.b, #84\n} 2 } } */ /* { dg-final { scan-assembler-times {\text\tz[0-9]+\.b, z[0-9]+\.b, z[0-9]+\.b, #94\n} 2 } } */ /* { dg-final { scan-assembler-times {\text\tz[0-9]+\.b, z[0-9]+\.b, z[0-9]+\.b, #100\n} 1 } } */ +/* { dg-final { scan-assembler-times {\text\tz[0-9]+\.b, z[0-9]+\.b, z[0-9]+\.b, #120\n} 2 } } */ +/* { dg-final { scan-assembler-times {\text\tz[0-9]+\.b, z[0-9]+\.b, z[0-9]+\.b, #124\n} 2 } } */ +/* { dg-final { scan-assembler-times {\text\tz[0-9]+\.b, z[0-9]+\.b, z[0-9]+\.b, #126\n} 2 } } */ +/* { dg-final { scan-assembler-times {\text\tz[0-9]+\.b, z[0-9]+\.b, z[0-9]+\.b, #127\n} 1 } } */ diff --git a/gcc/testsuite/gcc.target/aarch64/sve/extract_4.c b/gcc/testsuite/gcc.target/aarch64/sve/extract_4.c index a706291023f..0fa91757dfd 100644 --- a/gcc/testsuite/gcc.target/aarch64/sve/extract_4.c +++ b/gcc/testsuite/gcc.target/aarch64/sve/extract_4.c @@ -84,18 +84,16 @@ typedef _Float16 v128hf __attribute__((vector_size (256))); TEST_ALL (EXTRACT) -/* { dg-final { scan-assembler-times {\tfmov\tx[0-9]+, d[0-9]+\n} 6 { target aarch64_little_endian } } } */ -/* { dg-final { scan-assembler-times {\tumov\tx[0-9]+, v[0-9]+\.d\[0\]\n} 1 { target aarch64_big_endian } } } */ +/* { dg-final { scan-assembler-times {\tfmov\tx[0-9]+, d[0-9]+\n} 7 { target aarch64_little_endian } } } */ +/* { dg-final { scan-assembler-times {\tumov\tx[0-9]+, v[0-9]+\.d\[0\]\n} 2 { target aarch64_big_endian } } } */ /* { dg-final { scan-assembler-times {\tumov\tx[0-9]+, v[0-9]+\.d\[1\]\n} 1 } } */ /* { dg-final { scan-assembler-not {\tdup\td[0-9]+, v[0-9]+\.d\[0\]\n} } } */ /* { dg-final { scan-assembler-times {\tdup\td[0-9]+, v[0-9]+\.d\[1\]\n} 1 } } */ /* { dg-final { scan-assembler-times {\tdup\tz[0-9]+\.d, z[0-9]+\.d\[2\]\n} 2 } } */ /* { dg-final { scan-assembler-times {\tdup\tz[0-9]+\.d, z[0-9]+\.d\[7\]\n} 2 } } */ -/* { dg-final { scan-assembler-times {\tlastb\tx[0-9]+, p[0-7], z[0-9]+\.d\n} 1 } } */ -/* { dg-final { scan-assembler-times {\tlastb\td[0-9]+, p[0-7], z[0-9]+\.d\n} 1 } } */ -/* { dg-final { scan-assembler-times {\tfmov\tw[0-9]+, s[0-9]+\n} 6 { target aarch64_little_endian } } } */ -/* { dg-final { scan-assembler-times {\tumov\tw[0-9]+, v[0-9]+\.s\[0\]\n} 1 { target aarch64_big_endian } } } */ +/* { dg-final { scan-assembler-times {\tfmov\tw[0-9]+, s[0-9]+\n} 7 { target aarch64_little_endian } } } */ +/* { dg-final { scan-assembler-times {\tumov\tw[0-9]+, v[0-9]+\.s\[0\]\n} 2 { target aarch64_big_endian } } } */ /* { dg-final { scan-assembler-times {\tumov\tw[0-9]+, v[0-9]+\.s\[1\]\n} 1 } } */ /* { dg-final { scan-assembler-times {\tumov\tw[0-9]+, v[0-9]+\.s\[3\]\n} 1 } } */ /* { dg-final { scan-assembler-not {\tdup\ts[0-9]+, v[0-9]+\.s\[0\]\n} } } */ @@ -103,11 +101,9 @@ TEST_ALL (EXTRACT) /* { dg-final { scan-assembler-times {\tdup\ts[0-9]+, v[0-9]+\.s\[3\]\n} 1 } } */ /* { dg-final { scan-assembler-times {\tdup\tz[0-9]+\.s, z[0-9]+\.s\[4\]\n} 2 } } */ /* { dg-final { scan-assembler-times {\tdup\tz[0-9]+\.s, z[0-9]+\.s\[15\]\n} 2 } } */ -/* { dg-final { scan-assembler-times {\tlastb\tw[0-9]+, p[0-7], z[0-9]+\.s\n} 1 } } */ -/* { dg-final { scan-assembler-times {\tlastb\ts[0-9]+, p[0-7], z[0-9]+\.s\n} 1 } } */ /* Also used to move the result of a non-Advanced SIMD extract. */ -/* { dg-final { scan-assembler-times {\tumov\tw[0-9]+, v[0-9]+\.h\[0\]\n} 6 } } */ +/* { dg-final { scan-assembler-times {\tumov\tw[0-9]+, v[0-9]+\.h\[0\]\n} 7 } } */ /* { dg-final { scan-assembler-times {\tumov\tw[0-9]+, v[0-9]+\.h\[1\]\n} 1 } } */ /* { dg-final { scan-assembler-times {\tumov\tw[0-9]+, v[0-9]+\.h\[7\]\n} 1 } } */ /* { dg-final { scan-assembler-not {\tdup\th[0-9]+, v[0-9]+\.h\[0\]\n} } } */ @@ -115,16 +111,13 @@ TEST_ALL (EXTRACT) /* { dg-final { scan-assembler-times {\tdup\th[0-9]+, v[0-9]+\.h\[7\]\n} 1 } } */ /* { dg-final { scan-assembler-times {\tdup\tz[0-9]+\.h, z[0-9]+\.h\[8\]\n} 2 } } */ /* { dg-final { scan-assembler-times {\tdup\tz[0-9]+\.h, z[0-9]+\.h\[31\]\n} 2 } } */ -/* { dg-final { scan-assembler-times {\tlastb\tw[0-9]+, p[0-7], z[0-9]+\.h\n} 1 } } */ -/* { dg-final { scan-assembler-times {\tlastb\th[0-9]+, p[0-7], z[0-9]+\.h\n} 1 } } */ /* Also used to move the result of a non-Advanced SIMD extract. */ -/* { dg-final { scan-assembler-times {\tumov\tw[0-9]+, v[0-9]+\.b\[0\]\n} 6 } } */ +/* { dg-final { scan-assembler-times {\tumov\tw[0-9]+, v[0-9]+\.b\[0\]\n} 7 } } */ /* { dg-final { scan-assembler-times {\tumov\tw[0-9]+, v[0-9]+\.b\[1\]\n} 1 } } */ /* { dg-final { scan-assembler-times {\tumov\tw[0-9]+, v[0-9]+\.b\[15\]\n} 1 } } */ /* { dg-final { scan-assembler-times {\tdup\tz[0-9]+\.b, z[0-9]+\.b\[16\]\n} 1 } } */ /* { dg-final { scan-assembler-times {\tdup\tz[0-9]+\.b, z[0-9]+\.b\[63\]\n} 1 } } */ -/* { dg-final { scan-assembler-times {\tlastb\tw[0-9]+, p[0-7], z[0-9]+\.b\n} 1 } } */ /* { dg-final { scan-assembler-times {\text\tz[0-9]+\.b, z[0-9]+\.b, z[0-9]+\.b, #64\n} 7 } } */ /* { dg-final { scan-assembler-times {\text\tz[0-9]+\.b, z[0-9]+\.b, z[0-9]+\.b, #72\n} 2 } } */ @@ -135,3 +128,7 @@ TEST_ALL (EXTRACT) /* { dg-final { scan-assembler-times {\text\tz[0-9]+\.b, z[0-9]+\.b, z[0-9]+\.b, #124\n} 2 } } */ /* { dg-final { scan-assembler-times {\text\tz[0-9]+\.b, z[0-9]+\.b, z[0-9]+\.b, #126\n} 2 } } */ /* { dg-final { scan-assembler-times {\text\tz[0-9]+\.b, z[0-9]+\.b, z[0-9]+\.b, #127\n} 1 } } */ +/* { dg-final { scan-assembler-times {\text\tz[0-9]+\.b, z[0-9]+\.b, z[0-9]+\.b, #248\n} 2 } } */ +/* { dg-final { scan-assembler-times {\text\tz[0-9]+\.b, z[0-9]+\.b, z[0-9]+\.b, #252\n} 2 } } */ +/* { dg-final { scan-assembler-times {\text\tz[0-9]+\.b, z[0-9]+\.b, z[0-9]+\.b, #254\n} 2 } } */ +/* { dg-final { scan-assembler-times {\text\tz[0-9]+\.b, z[0-9]+\.b, z[0-9]+\.b, #255\n} 1 } } */ diff --git a/gcc/testsuite/gcc.target/aarch64/sve/extract_last_128.c b/gcc/testsuite/gcc.target/aarch64/sve/extract_last_128.c new file mode 100644 index 00000000000..2684fb69756 --- /dev/null +++ b/gcc/testsuite/gcc.target/aarch64/sve/extract_last_128.c @@ -0,0 +1,33 @@ +/* { dg-do compile } */ +/* { dg-options "-O3 -msve-vector-bits=128" } */ + +#include <arm_sve.h> + +#define TEST(TYPE, TY) \ + TYPE extract_last_##TY (sv##TYPE x) \ + { \ + svbool_t pg = svpfalse (); \ + return svlastb_##TY (pg, x); \ + } + +TEST(bfloat16_t, bf16) +TEST(float16_t, f16) +TEST(float32_t, f32) +TEST(float64_t, f64) +TEST(int8_t, s8) +TEST(int16_t, s16) +TEST(int32_t, s32) +TEST(int64_t, s64) +TEST(uint8_t, u8) +TEST(uint16_t, u16) +TEST(uint32_t, u32) +TEST(uint64_t, u64) + +/* { dg-final { scan-assembler-times {\tdup\th0, v0\.h\[7\]} 2 } } */ +/* { dg-final { scan-assembler-times {\tdup\ts0, v0\.s\[3\]} 1 } } */ +/* { dg-final { scan-assembler-times {\tdup\td0, v0\.d\[1\]} 1 } } */ +/* { dg-final { scan-assembler-times {\tumov\tw0, v0\.h\[7\]} 2 } } */ +/* { dg-final { scan-assembler-times {\tumov\tw0, v0\.b\[15\]} 2 } } */ +/* { dg-final { scan-assembler-times {\tumov\tw0, v0\.s\[3\]} 2 } } */ +/* { dg-final { scan-assembler-times {\tumov\tx0, v0\.d\[1\]} 2 } } */ +/* { dg-final { scan-assembler-not "lastb" } } */ \ No newline at end of file -- 2.34.1 > > Thanks, > Richard
smime.p7s
Description: S/MIME cryptographic signature