Soumya AR <soum...@nvidia.com> writes:
> Hi Richard,
>
> Thanks for reviewing this!
>
> I’ve made the suggested changes and added the aarch64_ptrue_reg overload.

The updated patch is OK, thanks.

> Thank you for providing the implementation for this, I have added you 
> as a co-author for the patch, hope that works :)

Sure, that's ok with me FWIW.  IMO it isn't really necessary though,
so it's ok without too :)

Sometimes it just feels easier to write review comments in the form
of code snippets, since that's more direct & precise than describing
something in English.  It is still fundamentally a code review though.

Richard

>
> Best,
> Soumya
>
>
>> On 5 Dec 2024, at 10:08 PM, Richard Sandiford <richard.sandif...@arm.com> 
>> wrote:
>> 
>> External email: Use caution opening links or attachments
>> 
>> 
>> Soumya AR <soum...@nvidia.com> writes:
>>> The ASRD instruction on SVE performs an arithmetic shift right by an 
>>> immediate
>>> for divide.
>>> 
>>> This patch enables the use of ASRD with Neon modes.
>>> 
>>> For example:
>>> 
>>> int in[N], out[N];
>>> 
>>> void
>>> foo (void)
>>> {
>>>  for (int i = 0; i < N; i++)
>>>    out[i] = in[i] / 4;
>>> }
>>> 
>>> compiles to:
>>> 
>>>      ldr     q31, [x1, x0]
>>>      cmlt    v30.16b, v31.16b, #0
>>>      and     z30.b, z30.b, 3
>>>      add     v30.16b, v30.16b, v31.16b
>>>      sshr    v30.16b, v30.16b, 2
>>>      str     q30, [x0, x2]
>>>      add     x0, x0, 16
>>>      cmp     x0, 1024
>>> 
>>> but can just be:
>>> 
>>>      ldp     q30, q31, [x0], 32
>>>      asrd    z31.b, p7/m, z31.b, #2
>>>      asrd    z30.b, p7/m, z30.b, #2
>>>      stp     q30, q31, [x1], 32
>>>      cmp     x0, x2
>>> 
>>> The patch was bootstrapped and regtested on aarch64-linux-gnu, no 
>>> regression.
>>> OK for mainline?
>>> 
>>> Signed-off-by: Soumya AR <soum...@nvidia.com>
>>> 
>>> gcc/ChangeLog:
>>> 
>>>      * config/aarch64/aarch64-sve.md: Extended sdiv_pow2<mode>3 and
>>>      *sdiv_pow2<mode>3 to support Neon modes.
>>> 
>>> gcc/testsuite/ChangeLog:
>>> 
>>>      * gcc.target/aarch64/sve/sve-asrd.c: New test.
>>> ---
>>> gcc/config/aarch64/aarch64-sve.md             | 25 ++++-----
>>> .../gcc.target/aarch64/sve/sve-asrd.c         | 54 +++++++++++++++++++
>>> 2 files changed, 67 insertions(+), 12 deletions(-)
>>> create mode 100644 gcc/testsuite/gcc.target/aarch64/sve/sve-asrd.c
>> 
>> Rearranging to put the testcase first:
>> 
>>> diff --git a/gcc/testsuite/gcc.target/aarch64/sve/sve-asrd.c 
>>> b/gcc/testsuite/gcc.target/aarch64/sve/sve-asrd.c
>>> new file mode 100644
>>> index 00000000000..00aa8b2380d
>>> --- /dev/null
>>> +++ b/gcc/testsuite/gcc.target/aarch64/sve/sve-asrd.c
>>> @@ -0,0 +1,54 @@
>>> +/* { dg-do compile } */
>>> +/* { dg-options "-Ofast --param aarch64-autovec-preference=asimd-only" } */
>>> +/* { dg-final { check-function-bodies "**" "" "" } } */
>>> +
>>> +#include <stdint.h>
>>> +#define N 1024
>>> +
>>> +#define FUNC(M)                     \
>>> +M in_##M[N];                        \
>>> +M out_##M[N];                       \
>>> +void asrd_##M() {                   \
>>> +  for (int i = 0; i < N; i++)       \
>>> +    out_##M[i] = in_##M[i] / 4;     \
>>> +}
>>> +
>>> +/*
>>> +** asrd_int8_t:
>>> +**   ...
>>> +**   ptrue   (p[0-7]).b, vl1
>>> +**   ...
>>> +**   asrd    z[0-9]+\.b, \1/m, z[0-9]+\.b, #2
>>> +**   ...
>>> +*/
>>> +FUNC(int8_t)
>> 
>> ptrue ..., vl1 will only do the shift/division on the first element.
>> It should instead be vl8 for 64-bit modes and vl16 for 128-bit modes.
>> 
>>> +
>>> +/*
>>> +** asrd_int16_t:
>>> +**   ...
>>> +**   ptrue   (p[0-7]).b, vl2
>>> +**   ...
>>> +**   asrd    z[0-9]+\.h, \1/m, z[0-9]+\.h, #2
>>> +**   ...
>>> +*/
>>> +FUNC(int16_t)
>>> +
>>> +/*
>>> +** asrd_int32_t:
>>> +**   ...
>>> +**   ptrue   (p[0-7]).b, vl4
>>> +**   ...
>>> +**   asrd    z[0-9]+\.s, \1/m, z[0-9]+\.s, #2
>>> +**   ...
>>> +*/
>>> +FUNC(int32_t)
>>> +
>>> +/*
>>> +** asrd_int64_t:
>>> +**   ...
>>> +**   ptrue   (p[0-7]).b, vl8
>>> +**   ...
>>> +**   asrd    z[0-9]+\.d, \1/m, z[0-9]+\.d, #2
>>> +**   ...
>>> +*/
>>> +FUNC(int64_t)
>>> diff --git a/gcc/config/aarch64/aarch64-sve.md 
>>> b/gcc/config/aarch64/aarch64-sve.md
>>> index affdb24a93d..96effe4abed 100644
>>> --- a/gcc/config/aarch64/aarch64-sve.md
>>> +++ b/gcc/config/aarch64/aarch64-sve.md
>>> @@ -4972,34 +4972,35 @@
>>> 
>>> ;; Unpredicated ASRD.
>>> (define_expand "sdiv_pow2<mode>3"
>>> -  [(set (match_operand:SVE_I 0 "register_operand")
>>> -     (unspec:SVE_I
>>> +  [(set (match_operand:SVE_VDQ_I 0 "register_operand")
>>> +     (unspec:SVE_VDQ_I
>>>        [(match_dup 3)
>>> -        (unspec:SVE_I
>>> -          [(match_operand:SVE_I 1 "register_operand")
>>> +        (unspec:SVE_VDQ_I
>>> +          [(match_operand:SVE_VDQ_I 1 "register_operand")
>>>            (match_operand 2 "aarch64_simd_rshift_imm")]
>>>           UNSPEC_ASRD)]
>>>       UNSPEC_PRED_X))]
>>>   "TARGET_SVE"
>>>   {
>>> -    operands[3] = aarch64_ptrue_reg (<VPRED>mode);
>>> +    operands[3] = aarch64_ptrue_reg (<VPRED>mode,
>>> +                                 GET_MODE_UNIT_SIZE (<MODE>mode));
>> 
>> Perhaps we should add yet another overload of aarch64_ptrue_reg that
>> takes the data mode as a second argument.  The implementation could
>> be something like:
>> 
>> /* Return a register of mode PRED_MODE for controlling data of mode 
>> DATA_MODE.
>> 
>>   DATA_MODE can be a scalar, an Advanced SIMD vector, or an SVE vector.
>>   If it's an N-byte scalar or an Advanced SIMD vector, the first N bits
>>   of the predicate will be active and the rest will be inactive.
>>   If DATA_MODE is an SVE mode, every bit of the predicate will be active.  */
>> static rtx
>> aarch64_ptrue_reg (machine_mode pred_mode, machine_mode data_mode)
>> {
>>  if (aarch64_sve_mode_p (data_mode))
>>    return aarch64_ptrue_reg (pred_mode);
>> 
>>  auto size = GET_MODE_SIZE (data_mode).to_constant ();
>>  return aarch64_ptrue_reg (pred_mode, size);
>> }
>> 
>> Thanks,
>> Richard
>> 
>>>   }
>>> )
>>> 
>>> ;; Predicated ASRD.
>>> (define_insn "*sdiv_pow2<mode>3"
>>> -  [(set (match_operand:SVE_I 0 "register_operand")
>>> -     (unspec:SVE_I
>>> +  [(set (match_operand:SVE_VDQ_I 0 "register_operand")
>>> +     (unspec:SVE_VDQ_I
>>>        [(match_operand:<VPRED> 1 "register_operand")
>>> -        (unspec:SVE_I
>>> -          [(match_operand:SVE_I 2 "register_operand")
>>> -           (match_operand:SVE_I 3 "aarch64_simd_rshift_imm")]
>>> +        (unspec:SVE_VDQ_I
>>> +          [(match_operand:SVE_VDQ_I 2 "register_operand")
>>> +           (match_operand:SVE_VDQ_I 3 "aarch64_simd_rshift_imm")]
>>>           UNSPEC_ASRD)]
>>>        UNSPEC_PRED_X))]
>>>   "TARGET_SVE"
>>>   {@ [ cons: =0 , 1   , 2 ; attrs: movprfx ]
>>> -     [ w        , Upl , 0 ; *              ] asrd\t%0.<Vetype>, %1/m, 
>>> %0.<Vetype>, #%3
>>> -     [ ?&w      , Upl , w ; yes            ] movprfx\t%0, 
>>> %2\;asrd\t%0.<Vetype>, %1/m, %0.<Vetype>, #%3
>>> +     [ w        , Upl , 0 ; *              ] asrd\t%Z0.<Vetype>, %1/m, 
>>> %Z0.<Vetype>, #%3
>>> +     [ ?&w      , Upl , w ; yes            ] movprfx\t%Z0, 
>>> %Z2\;asrd\t%Z0.<Vetype>, %1/m, %Z0.<Vetype>, #%3
>>>   }
>>> )
>
>
> From 2f5df569bd71f87cf7ec580f8511eff3b771dcfb Mon Sep 17 00:00:00 2001
> From: Soumya AR <soum...@nvidia.com>
> Date: Tue, 10 Dec 2024 10:37:11 +0530
> Subject: [PATCH] aarch64: Use SVE ASRD instruction with Neon modes.
>
> The ASRD instruction on SVE performs an arithmetic shift right by an immediate
> for divide.
>
> This patch enables the use of ASRD with Neon modes.
>
> For example:
>
> int in[N], out[N];
>
> void
> foo (void)
> {
>   for (int i = 0; i < N; i++)
>     out[i] = in[i] / 4;
> }
>
> compiles to:
>
>       ldr     q31, [x1, x0]
>       cmlt    v30.16b, v31.16b, #0
>       and     z30.b, z30.b, 3
>       add     v30.16b, v30.16b, v31.16b
>       sshr    v30.16b, v30.16b, 2
>       str     q30, [x0, x2]
>       add     x0, x0, 16
>       cmp     x0, 1024
>
> but can just be:
>
>       ldp     q30, q31, [x0], 32
>       asrd    z31.b, p7/m, z31.b, #2
>       asrd    z30.b, p7/m, z30.b, #2
>       stp     q30, q31, [x1], 32
>       cmp     x0, x2
>
> This patch also adds the following overload:
>       aarch64_ptrue_reg (machine_mode pred_mode, machine_mode data_mode)
> Depending on the data mode, the function returns a predicate with the
> appropriate bits set.
>
> The patch was bootstrapped and regtested on aarch64-linux-gnu, no regression.
> OK for mainline?
>
> gcc/ChangeLog:
>
>       * config/aarch64/aarch64.cc (aarch64_ptrue_reg): New overload.
>       * config/aarch64/aarch64-protos.h (aarch64_ptrue_reg): Likewise.
>       * config/aarch64/aarch64-sve.md: Extended sdiv_pow2<mode>3 and
>       *sdiv_pow2<mode>3 to support Neon modes.
>
> gcc/testsuite/ChangeLog:
>
>       * gcc.target/aarch64/sve/sve-asrd.c: New test.
>
> Co-authored-by: Richard Sandiford <richard.sandif...@arm.com>
> Signed-off-by: Soumya AR <soum...@nvidia.com>
> ---
>  gcc/config/aarch64/aarch64-protos.h           |  1 +
>  gcc/config/aarch64/aarch64-sve.md             | 24 +++---
>  gcc/config/aarch64/aarch64.cc                 | 16 ++++
>  .../gcc.target/aarch64/sve/sve-asrd.c         | 86 +++++++++++++++++++
>  4 files changed, 115 insertions(+), 12 deletions(-)
>  create mode 100644 gcc/testsuite/gcc.target/aarch64/sve/sve-asrd.c
>
> diff --git a/gcc/config/aarch64/aarch64-protos.h 
> b/gcc/config/aarch64/aarch64-protos.h
> index 72df4a575b3..554066d4cda 100644
> --- a/gcc/config/aarch64/aarch64-protos.h
> +++ b/gcc/config/aarch64/aarch64-protos.h
> @@ -1017,6 +1017,7 @@ void aarch64_expand_mov_immediate (rtx, rtx);
>  rtx aarch64_stack_protect_canary_mem (machine_mode, rtx, aarch64_salt_type);
>  rtx aarch64_ptrue_reg (machine_mode);
>  rtx aarch64_ptrue_reg (machine_mode, unsigned int);
> +rtx aarch64_ptrue_reg (machine_mode, machine_mode);
>  rtx aarch64_pfalse_reg (machine_mode);
>  bool aarch64_sve_same_pred_for_ptest_p (rtx *, rtx *);
>  void aarch64_emit_sve_pred_move (rtx, rtx, rtx);
> diff --git a/gcc/config/aarch64/aarch64-sve.md 
> b/gcc/config/aarch64/aarch64-sve.md
> index affdb24a93d..494c2208ffc 100644
> --- a/gcc/config/aarch64/aarch64-sve.md
> +++ b/gcc/config/aarch64/aarch64-sve.md
> @@ -4972,34 +4972,34 @@
>  
>  ;; Unpredicated ASRD.
>  (define_expand "sdiv_pow2<mode>3"
> -  [(set (match_operand:SVE_I 0 "register_operand")
> -     (unspec:SVE_I
> +  [(set (match_operand:SVE_VDQ_I 0 "register_operand")
> +     (unspec:SVE_VDQ_I
>         [(match_dup 3)
> -        (unspec:SVE_I
> -          [(match_operand:SVE_I 1 "register_operand")
> +        (unspec:SVE_VDQ_I
> +          [(match_operand:SVE_VDQ_I 1 "register_operand")
>             (match_operand 2 "aarch64_simd_rshift_imm")]
>            UNSPEC_ASRD)]
>        UNSPEC_PRED_X))]
>    "TARGET_SVE"
>    {
> -    operands[3] = aarch64_ptrue_reg (<VPRED>mode);
> +    operands[3] = aarch64_ptrue_reg (<VPRED>mode, <MODE>mode);
>    }
>  )
>  
>  ;; Predicated ASRD.
>  (define_insn "*sdiv_pow2<mode>3"
> -  [(set (match_operand:SVE_I 0 "register_operand")
> -     (unspec:SVE_I
> +  [(set (match_operand:SVE_VDQ_I 0 "register_operand")
> +     (unspec:SVE_VDQ_I
>         [(match_operand:<VPRED> 1 "register_operand")
> -        (unspec:SVE_I
> -          [(match_operand:SVE_I 2 "register_operand")
> -           (match_operand:SVE_I 3 "aarch64_simd_rshift_imm")]
> +        (unspec:SVE_VDQ_I
> +          [(match_operand:SVE_VDQ_I 2 "register_operand")
> +           (match_operand:SVE_VDQ_I 3 "aarch64_simd_rshift_imm")]
>            UNSPEC_ASRD)]
>         UNSPEC_PRED_X))]
>    "TARGET_SVE"
>    {@ [ cons: =0 , 1   , 2 ; attrs: movprfx ]
> -     [ w        , Upl , 0 ; *              ] asrd\t%0.<Vetype>, %1/m, 
> %0.<Vetype>, #%3
> -     [ ?&w      , Upl , w ; yes            ] movprfx\t%0, 
> %2\;asrd\t%0.<Vetype>, %1/m, %0.<Vetype>, #%3
> +     [ w        , Upl , 0 ; *              ] asrd\t%Z0.<Vetype>, %1/m, 
> %Z0.<Vetype>, #%3
> +     [ ?&w      , Upl , w ; yes            ] movprfx\t%Z0, 
> %Z2\;asrd\t%Z0.<Vetype>, %1/m, %Z0.<Vetype>, #%3
>    }
>  )
>  
> diff --git a/gcc/config/aarch64/aarch64.cc b/gcc/config/aarch64/aarch64.cc
> index 00bcf18ae97..68321368da1 100644
> --- a/gcc/config/aarch64/aarch64.cc
> +++ b/gcc/config/aarch64/aarch64.cc
> @@ -3682,6 +3682,22 @@ aarch64_ptrue_reg (machine_mode mode, unsigned int vl)
>    return gen_lowpart (mode, reg);
>  }
>  
> +/* Return a register of mode PRED_MODE for controlling data of mode 
> DATA_MODE.
> +
> +   DATA_MODE can be a scalar, an Advanced SIMD vector, or an SVE vector.
> +   If it's an N-byte scalar or an Advanced SIMD vector, the first N bits
> +   of the predicate will be active and the rest will be inactive.
> +   If DATA_MODE is an SVE mode, every bit of the predicate will be active.  
> */
> +rtx
> +aarch64_ptrue_reg (machine_mode pred_mode, machine_mode data_mode)
> +{
> +  if (aarch64_sve_mode_p (data_mode))
> +    return aarch64_ptrue_reg (pred_mode);
> +
> +  auto size = GET_MODE_SIZE (data_mode).to_constant ();
> +  return aarch64_ptrue_reg (pred_mode, size);
> +}
> +
>  /* Return an all-false predicate register of mode MODE.  */
>  
>  rtx
> diff --git a/gcc/testsuite/gcc.target/aarch64/sve/sve-asrd.c 
> b/gcc/testsuite/gcc.target/aarch64/sve/sve-asrd.c
> new file mode 100644
> index 00000000000..341baae505c
> --- /dev/null
> +++ b/gcc/testsuite/gcc.target/aarch64/sve/sve-asrd.c
> @@ -0,0 +1,86 @@
> +/* { dg-do compile } */
> +/* { dg-options "-Ofast --param aarch64-autovec-preference=asimd-only" } */
> +/* { dg-final { check-function-bodies "**" "" "" } } */
> +
> +#include <stdint.h>
> +
> +#define FUNC(TYPE, I)                                                        
>   \
> +  TYPE M_##TYPE##_##I[I];                                                    
>   \
> +  void asrd_##TYPE##_##I ()                                                  
>   \
> +  {                                                                          
>   \
> +    for (int i = 0; i < I; i++)                                              
>   \
> +      {                                                                      
>   \
> +     M_##TYPE##_##I[i] /= 4;                                                \
> +      }                                                                      
>   \
> +  }
> +
> +/*
> +** asrd_int8_t_8:
> +**   ...
> +**   ptrue   (p[0-7]).b, vl8
> +**   ...
> +**   asrd    z[0-9]+\.b, \1/m, z[0-9]+\.b, #2
> +**   ...
> +*/
> +FUNC(int8_t, 8);
> +
> +/*
> +** asrd_int8_t_16:
> +**   ...
> +**   ptrue   (p[0-7]).b, vl16
> +**   ...
> +**   asrd    z[0-9]+\.b, \1/m, z[0-9]+\.b, #2
> +**   ...
> +*/
> +FUNC(int8_t, 16);
> +
> +/*
> +** asrd_int16_t_4:
> +**   ...
> +**   ptrue   (p[0-7]).b, vl8
> +**   ...
> +**   asrd    z[0-9]+\.h, \1/m, z[0-9]+\.h, #2
> +**   ...
> +*/
> +FUNC(int16_t, 4);
> +
> +/*
> +** asrd_int16_t_8:
> +**   ...
> +**   ptrue   (p[0-7]).b, vl16
> +**   ...
> +**   asrd    z[0-9]+\.h, \1/m, z[0-9]+\.h, #2
> +**   ...
> +*/
> +FUNC(int16_t, 8);
> +
> +/*
> +** asrd_int32_t_2:
> +**   ...
> +**   ptrue   (p[0-7]).b, vl8
> +**   ...
> +**   asrd    z[0-9]+\.s, \1/m, z[0-9]+\.s, #2
> +**   ...
> +*/
> +FUNC(int32_t, 2);
> +
> +/*
> +** asrd_int32_t_4:
> +**   ...
> +**   ptrue   (p[0-7]).b, vl16
> +**   ...
> +**   asrd    z[0-9]+\.s, \1/m, z[0-9]+\.s, #2
> +**   ...
> +*/
> +FUNC(int32_t, 4);
> +
> +/*
> +** asrd_int64_t_2:
> +**   ...
> +**   ptrue   (p[0-7]).b, vl16
> +**   ...
> +**   asrd    z[0-9]+\.d, \1/m, z[0-9]+\.d, #2
> +**   ...
> +*/
> +FUNC(int64_t, 2);
> +

Reply via email to