Soumya AR <soum...@nvidia.com> writes: > Hi Richard, > > Thanks for reviewing this! > > I’ve made the suggested changes and added the aarch64_ptrue_reg overload.
The updated patch is OK, thanks. > Thank you for providing the implementation for this, I have added you > as a co-author for the patch, hope that works :) Sure, that's ok with me FWIW. IMO it isn't really necessary though, so it's ok without too :) Sometimes it just feels easier to write review comments in the form of code snippets, since that's more direct & precise than describing something in English. It is still fundamentally a code review though. Richard > > Best, > Soumya > > >> On 5 Dec 2024, at 10:08 PM, Richard Sandiford <richard.sandif...@arm.com> >> wrote: >> >> External email: Use caution opening links or attachments >> >> >> Soumya AR <soum...@nvidia.com> writes: >>> The ASRD instruction on SVE performs an arithmetic shift right by an >>> immediate >>> for divide. >>> >>> This patch enables the use of ASRD with Neon modes. >>> >>> For example: >>> >>> int in[N], out[N]; >>> >>> void >>> foo (void) >>> { >>> for (int i = 0; i < N; i++) >>> out[i] = in[i] / 4; >>> } >>> >>> compiles to: >>> >>> ldr q31, [x1, x0] >>> cmlt v30.16b, v31.16b, #0 >>> and z30.b, z30.b, 3 >>> add v30.16b, v30.16b, v31.16b >>> sshr v30.16b, v30.16b, 2 >>> str q30, [x0, x2] >>> add x0, x0, 16 >>> cmp x0, 1024 >>> >>> but can just be: >>> >>> ldp q30, q31, [x0], 32 >>> asrd z31.b, p7/m, z31.b, #2 >>> asrd z30.b, p7/m, z30.b, #2 >>> stp q30, q31, [x1], 32 >>> cmp x0, x2 >>> >>> The patch was bootstrapped and regtested on aarch64-linux-gnu, no >>> regression. >>> OK for mainline? >>> >>> Signed-off-by: Soumya AR <soum...@nvidia.com> >>> >>> gcc/ChangeLog: >>> >>> * config/aarch64/aarch64-sve.md: Extended sdiv_pow2<mode>3 and >>> *sdiv_pow2<mode>3 to support Neon modes. >>> >>> gcc/testsuite/ChangeLog: >>> >>> * gcc.target/aarch64/sve/sve-asrd.c: New test. >>> --- >>> gcc/config/aarch64/aarch64-sve.md | 25 ++++----- >>> .../gcc.target/aarch64/sve/sve-asrd.c | 54 +++++++++++++++++++ >>> 2 files changed, 67 insertions(+), 12 deletions(-) >>> create mode 100644 gcc/testsuite/gcc.target/aarch64/sve/sve-asrd.c >> >> Rearranging to put the testcase first: >> >>> diff --git a/gcc/testsuite/gcc.target/aarch64/sve/sve-asrd.c >>> b/gcc/testsuite/gcc.target/aarch64/sve/sve-asrd.c >>> new file mode 100644 >>> index 00000000000..00aa8b2380d >>> --- /dev/null >>> +++ b/gcc/testsuite/gcc.target/aarch64/sve/sve-asrd.c >>> @@ -0,0 +1,54 @@ >>> +/* { dg-do compile } */ >>> +/* { dg-options "-Ofast --param aarch64-autovec-preference=asimd-only" } */ >>> +/* { dg-final { check-function-bodies "**" "" "" } } */ >>> + >>> +#include <stdint.h> >>> +#define N 1024 >>> + >>> +#define FUNC(M) \ >>> +M in_##M[N]; \ >>> +M out_##M[N]; \ >>> +void asrd_##M() { \ >>> + for (int i = 0; i < N; i++) \ >>> + out_##M[i] = in_##M[i] / 4; \ >>> +} >>> + >>> +/* >>> +** asrd_int8_t: >>> +** ... >>> +** ptrue (p[0-7]).b, vl1 >>> +** ... >>> +** asrd z[0-9]+\.b, \1/m, z[0-9]+\.b, #2 >>> +** ... >>> +*/ >>> +FUNC(int8_t) >> >> ptrue ..., vl1 will only do the shift/division on the first element. >> It should instead be vl8 for 64-bit modes and vl16 for 128-bit modes. >> >>> + >>> +/* >>> +** asrd_int16_t: >>> +** ... >>> +** ptrue (p[0-7]).b, vl2 >>> +** ... >>> +** asrd z[0-9]+\.h, \1/m, z[0-9]+\.h, #2 >>> +** ... >>> +*/ >>> +FUNC(int16_t) >>> + >>> +/* >>> +** asrd_int32_t: >>> +** ... >>> +** ptrue (p[0-7]).b, vl4 >>> +** ... >>> +** asrd z[0-9]+\.s, \1/m, z[0-9]+\.s, #2 >>> +** ... >>> +*/ >>> +FUNC(int32_t) >>> + >>> +/* >>> +** asrd_int64_t: >>> +** ... >>> +** ptrue (p[0-7]).b, vl8 >>> +** ... >>> +** asrd z[0-9]+\.d, \1/m, z[0-9]+\.d, #2 >>> +** ... >>> +*/ >>> +FUNC(int64_t) >>> diff --git a/gcc/config/aarch64/aarch64-sve.md >>> b/gcc/config/aarch64/aarch64-sve.md >>> index affdb24a93d..96effe4abed 100644 >>> --- a/gcc/config/aarch64/aarch64-sve.md >>> +++ b/gcc/config/aarch64/aarch64-sve.md >>> @@ -4972,34 +4972,35 @@ >>> >>> ;; Unpredicated ASRD. >>> (define_expand "sdiv_pow2<mode>3" >>> - [(set (match_operand:SVE_I 0 "register_operand") >>> - (unspec:SVE_I >>> + [(set (match_operand:SVE_VDQ_I 0 "register_operand") >>> + (unspec:SVE_VDQ_I >>> [(match_dup 3) >>> - (unspec:SVE_I >>> - [(match_operand:SVE_I 1 "register_operand") >>> + (unspec:SVE_VDQ_I >>> + [(match_operand:SVE_VDQ_I 1 "register_operand") >>> (match_operand 2 "aarch64_simd_rshift_imm")] >>> UNSPEC_ASRD)] >>> UNSPEC_PRED_X))] >>> "TARGET_SVE" >>> { >>> - operands[3] = aarch64_ptrue_reg (<VPRED>mode); >>> + operands[3] = aarch64_ptrue_reg (<VPRED>mode, >>> + GET_MODE_UNIT_SIZE (<MODE>mode)); >> >> Perhaps we should add yet another overload of aarch64_ptrue_reg that >> takes the data mode as a second argument. The implementation could >> be something like: >> >> /* Return a register of mode PRED_MODE for controlling data of mode >> DATA_MODE. >> >> DATA_MODE can be a scalar, an Advanced SIMD vector, or an SVE vector. >> If it's an N-byte scalar or an Advanced SIMD vector, the first N bits >> of the predicate will be active and the rest will be inactive. >> If DATA_MODE is an SVE mode, every bit of the predicate will be active. */ >> static rtx >> aarch64_ptrue_reg (machine_mode pred_mode, machine_mode data_mode) >> { >> if (aarch64_sve_mode_p (data_mode)) >> return aarch64_ptrue_reg (pred_mode); >> >> auto size = GET_MODE_SIZE (data_mode).to_constant (); >> return aarch64_ptrue_reg (pred_mode, size); >> } >> >> Thanks, >> Richard >> >>> } >>> ) >>> >>> ;; Predicated ASRD. >>> (define_insn "*sdiv_pow2<mode>3" >>> - [(set (match_operand:SVE_I 0 "register_operand") >>> - (unspec:SVE_I >>> + [(set (match_operand:SVE_VDQ_I 0 "register_operand") >>> + (unspec:SVE_VDQ_I >>> [(match_operand:<VPRED> 1 "register_operand") >>> - (unspec:SVE_I >>> - [(match_operand:SVE_I 2 "register_operand") >>> - (match_operand:SVE_I 3 "aarch64_simd_rshift_imm")] >>> + (unspec:SVE_VDQ_I >>> + [(match_operand:SVE_VDQ_I 2 "register_operand") >>> + (match_operand:SVE_VDQ_I 3 "aarch64_simd_rshift_imm")] >>> UNSPEC_ASRD)] >>> UNSPEC_PRED_X))] >>> "TARGET_SVE" >>> {@ [ cons: =0 , 1 , 2 ; attrs: movprfx ] >>> - [ w , Upl , 0 ; * ] asrd\t%0.<Vetype>, %1/m, >>> %0.<Vetype>, #%3 >>> - [ ?&w , Upl , w ; yes ] movprfx\t%0, >>> %2\;asrd\t%0.<Vetype>, %1/m, %0.<Vetype>, #%3 >>> + [ w , Upl , 0 ; * ] asrd\t%Z0.<Vetype>, %1/m, >>> %Z0.<Vetype>, #%3 >>> + [ ?&w , Upl , w ; yes ] movprfx\t%Z0, >>> %Z2\;asrd\t%Z0.<Vetype>, %1/m, %Z0.<Vetype>, #%3 >>> } >>> ) > > > From 2f5df569bd71f87cf7ec580f8511eff3b771dcfb Mon Sep 17 00:00:00 2001 > From: Soumya AR <soum...@nvidia.com> > Date: Tue, 10 Dec 2024 10:37:11 +0530 > Subject: [PATCH] aarch64: Use SVE ASRD instruction with Neon modes. > > The ASRD instruction on SVE performs an arithmetic shift right by an immediate > for divide. > > This patch enables the use of ASRD with Neon modes. > > For example: > > int in[N], out[N]; > > void > foo (void) > { > for (int i = 0; i < N; i++) > out[i] = in[i] / 4; > } > > compiles to: > > ldr q31, [x1, x0] > cmlt v30.16b, v31.16b, #0 > and z30.b, z30.b, 3 > add v30.16b, v30.16b, v31.16b > sshr v30.16b, v30.16b, 2 > str q30, [x0, x2] > add x0, x0, 16 > cmp x0, 1024 > > but can just be: > > ldp q30, q31, [x0], 32 > asrd z31.b, p7/m, z31.b, #2 > asrd z30.b, p7/m, z30.b, #2 > stp q30, q31, [x1], 32 > cmp x0, x2 > > This patch also adds the following overload: > aarch64_ptrue_reg (machine_mode pred_mode, machine_mode data_mode) > Depending on the data mode, the function returns a predicate with the > appropriate bits set. > > The patch was bootstrapped and regtested on aarch64-linux-gnu, no regression. > OK for mainline? > > gcc/ChangeLog: > > * config/aarch64/aarch64.cc (aarch64_ptrue_reg): New overload. > * config/aarch64/aarch64-protos.h (aarch64_ptrue_reg): Likewise. > * config/aarch64/aarch64-sve.md: Extended sdiv_pow2<mode>3 and > *sdiv_pow2<mode>3 to support Neon modes. > > gcc/testsuite/ChangeLog: > > * gcc.target/aarch64/sve/sve-asrd.c: New test. > > Co-authored-by: Richard Sandiford <richard.sandif...@arm.com> > Signed-off-by: Soumya AR <soum...@nvidia.com> > --- > gcc/config/aarch64/aarch64-protos.h | 1 + > gcc/config/aarch64/aarch64-sve.md | 24 +++--- > gcc/config/aarch64/aarch64.cc | 16 ++++ > .../gcc.target/aarch64/sve/sve-asrd.c | 86 +++++++++++++++++++ > 4 files changed, 115 insertions(+), 12 deletions(-) > create mode 100644 gcc/testsuite/gcc.target/aarch64/sve/sve-asrd.c > > diff --git a/gcc/config/aarch64/aarch64-protos.h > b/gcc/config/aarch64/aarch64-protos.h > index 72df4a575b3..554066d4cda 100644 > --- a/gcc/config/aarch64/aarch64-protos.h > +++ b/gcc/config/aarch64/aarch64-protos.h > @@ -1017,6 +1017,7 @@ void aarch64_expand_mov_immediate (rtx, rtx); > rtx aarch64_stack_protect_canary_mem (machine_mode, rtx, aarch64_salt_type); > rtx aarch64_ptrue_reg (machine_mode); > rtx aarch64_ptrue_reg (machine_mode, unsigned int); > +rtx aarch64_ptrue_reg (machine_mode, machine_mode); > rtx aarch64_pfalse_reg (machine_mode); > bool aarch64_sve_same_pred_for_ptest_p (rtx *, rtx *); > void aarch64_emit_sve_pred_move (rtx, rtx, rtx); > diff --git a/gcc/config/aarch64/aarch64-sve.md > b/gcc/config/aarch64/aarch64-sve.md > index affdb24a93d..494c2208ffc 100644 > --- a/gcc/config/aarch64/aarch64-sve.md > +++ b/gcc/config/aarch64/aarch64-sve.md > @@ -4972,34 +4972,34 @@ > > ;; Unpredicated ASRD. > (define_expand "sdiv_pow2<mode>3" > - [(set (match_operand:SVE_I 0 "register_operand") > - (unspec:SVE_I > + [(set (match_operand:SVE_VDQ_I 0 "register_operand") > + (unspec:SVE_VDQ_I > [(match_dup 3) > - (unspec:SVE_I > - [(match_operand:SVE_I 1 "register_operand") > + (unspec:SVE_VDQ_I > + [(match_operand:SVE_VDQ_I 1 "register_operand") > (match_operand 2 "aarch64_simd_rshift_imm")] > UNSPEC_ASRD)] > UNSPEC_PRED_X))] > "TARGET_SVE" > { > - operands[3] = aarch64_ptrue_reg (<VPRED>mode); > + operands[3] = aarch64_ptrue_reg (<VPRED>mode, <MODE>mode); > } > ) > > ;; Predicated ASRD. > (define_insn "*sdiv_pow2<mode>3" > - [(set (match_operand:SVE_I 0 "register_operand") > - (unspec:SVE_I > + [(set (match_operand:SVE_VDQ_I 0 "register_operand") > + (unspec:SVE_VDQ_I > [(match_operand:<VPRED> 1 "register_operand") > - (unspec:SVE_I > - [(match_operand:SVE_I 2 "register_operand") > - (match_operand:SVE_I 3 "aarch64_simd_rshift_imm")] > + (unspec:SVE_VDQ_I > + [(match_operand:SVE_VDQ_I 2 "register_operand") > + (match_operand:SVE_VDQ_I 3 "aarch64_simd_rshift_imm")] > UNSPEC_ASRD)] > UNSPEC_PRED_X))] > "TARGET_SVE" > {@ [ cons: =0 , 1 , 2 ; attrs: movprfx ] > - [ w , Upl , 0 ; * ] asrd\t%0.<Vetype>, %1/m, > %0.<Vetype>, #%3 > - [ ?&w , Upl , w ; yes ] movprfx\t%0, > %2\;asrd\t%0.<Vetype>, %1/m, %0.<Vetype>, #%3 > + [ w , Upl , 0 ; * ] asrd\t%Z0.<Vetype>, %1/m, > %Z0.<Vetype>, #%3 > + [ ?&w , Upl , w ; yes ] movprfx\t%Z0, > %Z2\;asrd\t%Z0.<Vetype>, %1/m, %Z0.<Vetype>, #%3 > } > ) > > diff --git a/gcc/config/aarch64/aarch64.cc b/gcc/config/aarch64/aarch64.cc > index 00bcf18ae97..68321368da1 100644 > --- a/gcc/config/aarch64/aarch64.cc > +++ b/gcc/config/aarch64/aarch64.cc > @@ -3682,6 +3682,22 @@ aarch64_ptrue_reg (machine_mode mode, unsigned int vl) > return gen_lowpart (mode, reg); > } > > +/* Return a register of mode PRED_MODE for controlling data of mode > DATA_MODE. > + > + DATA_MODE can be a scalar, an Advanced SIMD vector, or an SVE vector. > + If it's an N-byte scalar or an Advanced SIMD vector, the first N bits > + of the predicate will be active and the rest will be inactive. > + If DATA_MODE is an SVE mode, every bit of the predicate will be active. > */ > +rtx > +aarch64_ptrue_reg (machine_mode pred_mode, machine_mode data_mode) > +{ > + if (aarch64_sve_mode_p (data_mode)) > + return aarch64_ptrue_reg (pred_mode); > + > + auto size = GET_MODE_SIZE (data_mode).to_constant (); > + return aarch64_ptrue_reg (pred_mode, size); > +} > + > /* Return an all-false predicate register of mode MODE. */ > > rtx > diff --git a/gcc/testsuite/gcc.target/aarch64/sve/sve-asrd.c > b/gcc/testsuite/gcc.target/aarch64/sve/sve-asrd.c > new file mode 100644 > index 00000000000..341baae505c > --- /dev/null > +++ b/gcc/testsuite/gcc.target/aarch64/sve/sve-asrd.c > @@ -0,0 +1,86 @@ > +/* { dg-do compile } */ > +/* { dg-options "-Ofast --param aarch64-autovec-preference=asimd-only" } */ > +/* { dg-final { check-function-bodies "**" "" "" } } */ > + > +#include <stdint.h> > + > +#define FUNC(TYPE, I) > \ > + TYPE M_##TYPE##_##I[I]; > \ > + void asrd_##TYPE##_##I () > \ > + { > \ > + for (int i = 0; i < I; i++) > \ > + { > \ > + M_##TYPE##_##I[i] /= 4; \ > + } > \ > + } > + > +/* > +** asrd_int8_t_8: > +** ... > +** ptrue (p[0-7]).b, vl8 > +** ... > +** asrd z[0-9]+\.b, \1/m, z[0-9]+\.b, #2 > +** ... > +*/ > +FUNC(int8_t, 8); > + > +/* > +** asrd_int8_t_16: > +** ... > +** ptrue (p[0-7]).b, vl16 > +** ... > +** asrd z[0-9]+\.b, \1/m, z[0-9]+\.b, #2 > +** ... > +*/ > +FUNC(int8_t, 16); > + > +/* > +** asrd_int16_t_4: > +** ... > +** ptrue (p[0-7]).b, vl8 > +** ... > +** asrd z[0-9]+\.h, \1/m, z[0-9]+\.h, #2 > +** ... > +*/ > +FUNC(int16_t, 4); > + > +/* > +** asrd_int16_t_8: > +** ... > +** ptrue (p[0-7]).b, vl16 > +** ... > +** asrd z[0-9]+\.h, \1/m, z[0-9]+\.h, #2 > +** ... > +*/ > +FUNC(int16_t, 8); > + > +/* > +** asrd_int32_t_2: > +** ... > +** ptrue (p[0-7]).b, vl8 > +** ... > +** asrd z[0-9]+\.s, \1/m, z[0-9]+\.s, #2 > +** ... > +*/ > +FUNC(int32_t, 2); > + > +/* > +** asrd_int32_t_4: > +** ... > +** ptrue (p[0-7]).b, vl16 > +** ... > +** asrd z[0-9]+\.s, \1/m, z[0-9]+\.s, #2 > +** ... > +*/ > +FUNC(int32_t, 4); > + > +/* > +** asrd_int64_t_2: > +** ... > +** ptrue (p[0-7]).b, vl16 > +** ... > +** asrd z[0-9]+\.d, \1/m, z[0-9]+\.d, #2 > +** ... > +*/ > +FUNC(int64_t, 2); > +