Re: [PATCH 2/2] aarch64: Fold lsl+lsr+orr to rev for half-width shifts

Richard Sandiford Tue, 20 May 2025 04:05:18 -0700

Dhruv Chawla <dhr...@nvidia.com> writes:
> On 06/05/25 21:57, Richard Sandiford wrote:
>> External email: Use caution opening links or attachments
>> 
>> 
>> Dhruv Chawla <dhr...@nvidia.com> writes:
>>> This patch modifies the intrinsic expanders to expand svlsl and svlsr to
>>> unpredicated forms when the predicate is a ptrue. It also folds the
>>> following pattern:
>>>
>>>    lsl <y>, <x>, <shift>
>>>    lsr <z>, <x>, <shift>
>>>    orr <r>, <y>, <z>
>>>
>>> to:
>>>
>>>    revb/h/w <r>, <x>
>>>
>>> when the shift amount is equal to half the bitwidth of the <x>
>>> register.
>>>
>>> Bootstrapped and regtested on aarch64-linux-gnu.
>>>
>>> Signed-off-by: Dhruv Chawla <dhr...@nvidia.com>
>>>
>>> gcc/ChangeLog:
>>>
>>>        * config/aarch64/aarch64-sve-builtins-base.cc
>>>        (svlsl_impl::expand): Define.
>>>        (svlsr_impl): New class.
>>>        (svlsr_impl::fold): Define.
>>>        (svlsr_impl::expand): Likewise.
>>>        * config/aarch64/aarch64-sve.md
>>>        (*v_rev<mode>): New pattern.
>>>        (*v_revvnx8hi): Likewise.
>>>
>>> gcc/testsuite/ChangeLog:
>>>
>>>        * gcc.target/aarch64/sve/shift_rev_1.c: New test.
>>>        * gcc.target/aarch64/sve/shift_rev_2.c: Likewise.
>>> ---
>>>   .../aarch64/aarch64-sve-builtins-base.cc      | 33 +++++++-
>>>   gcc/config/aarch64/aarch64-sve.md             | 49 +++++++++++
>>>   .../gcc.target/aarch64/sve/shift_rev_1.c      | 83 +++++++++++++++++++
>>>   .../gcc.target/aarch64/sve/shift_rev_2.c      | 63 ++++++++++++++
>>>   4 files changed, 227 insertions(+), 1 deletion(-)
>>>   create mode 100644 gcc/testsuite/gcc.target/aarch64/sve/shift_rev_1.c
>>>   create mode 100644 gcc/testsuite/gcc.target/aarch64/sve/shift_rev_2.c
>>>
>>> diff --git a/gcc/config/aarch64/aarch64-sve-builtins-base.cc 
>>> b/gcc/config/aarch64/aarch64-sve-builtins-base.cc
>>> index 927c5bbae21..938d010e11b 100644
>>> --- a/gcc/config/aarch64/aarch64-sve-builtins-base.cc
>>> +++ b/gcc/config/aarch64/aarch64-sve-builtins-base.cc
>>> @@ -2086,6 +2086,37 @@ public:
>>>     {
>>>       return f.fold_const_binary (LSHIFT_EXPR);
>>>     }
>>> +
>>> +  rtx expand (function_expander &e) const override
>>> +  {
>>> +    tree pred = TREE_OPERAND (e.call_expr, 3);
>>> +    tree shift = TREE_OPERAND (e.call_expr, 5);
>>> +    if (is_ptrue (pred, GET_MODE_UNIT_SIZE (e.result_mode ()))
>>> +     && uniform_integer_cst_p (shift))
>>> +      return e.use_unpred_insn (e.direct_optab_handler (ashl_optab));
>>> +    return rtx_code_function::expand (e);
>>> +  }
>>> +};
>>> +
>>> +class svlsr_impl : public rtx_code_function
>>> +{
>>> +public:
>>> +  CONSTEXPR svlsr_impl () : rtx_code_function (LSHIFTRT, LSHIFTRT) {}
>>> +
>>> +  gimple *fold (gimple_folder &f) const override
>>> +  {
>>> +    return f.fold_const_binary (RSHIFT_EXPR);
>>> +  }
>>> +
>>> +  rtx expand (function_expander &e) const override
>>> +  {
>>> +    tree pred = TREE_OPERAND (e.call_expr, 3);
>>> +    tree shift = TREE_OPERAND (e.call_expr, 5);
>>> +    if (is_ptrue (pred, GET_MODE_UNIT_SIZE (e.result_mode ()))
>>> +     && uniform_integer_cst_p (shift))
>>> +      return e.use_unpred_insn (e.direct_optab_handler (lshr_optab));
>>> +    return rtx_code_function::expand (e);
>>> +  }
>>>   };
>>>
>>>   class svmad_impl : public function_base
>>> @@ -3572,7 +3603,7 @@ FUNCTION (svldnt1, svldnt1_impl,)
>>>   FUNCTION (svlen, svlen_impl,)
>>>   FUNCTION (svlsl, svlsl_impl,)
>>>   FUNCTION (svlsl_wide, shift_wide, (ASHIFT, UNSPEC_ASHIFT_WIDE))
>>> -FUNCTION (svlsr, rtx_code_function, (LSHIFTRT, LSHIFTRT))
>>> +FUNCTION (svlsr, svlsr_impl, )
>>>   FUNCTION (svlsr_wide, shift_wide, (LSHIFTRT, UNSPEC_LSHIFTRT_WIDE))
>>>   FUNCTION (svmad, svmad_impl,)
>>>   FUNCTION (svmax, rtx_code_function, (SMAX, UMAX, UNSPEC_COND_FMAX,
>> 
>> I'm hoping that this won't be necessary after the changes I mentioned
>> in patch 1.  The expander should handle everything itself.
>
> Hi,
>
> Unfortunately this still turned out to be required - removing the changes to 
> the expander would cause a call to @aarch64_pred_<optab><mode> which would 
> bypass the whole v<optab><mode>3 pattern.


I think we should make @aarch64_pred_<optab><mode> drop the predicate
for constant shifts, just like v<optab><mode>3 avoids creating a predicate
for constant shifts.  Also, since the idea of patch 1 is to "lower" constant
shifts to the unpredicated form as soon as possible, rather than after
reload, the split in @aarch64_pred_<optab><mode> should no longer depend
on reload_completed.

And: my bad, but I realised after sending the review for the first patch
that v<optab><mode>3 can just test CONSTANT_P, since the predicate
already enforces the range.

Putting that together, the tests seem to work for me with the patch below,
where the aarch64-sve.md change would be part of patch 1.

> [...]
>> I can't remember if we discussed this before, but I think we could extend
>> this to partial vector modes (rather than SVE_FULL...) by using the
>> GET_MODE_UNIT_SIZE of the container mode.
>
> Hmm, I am not sure how this would work. This could be a follow-up patch.

Sure, that's fine.

> [...]
>>> diff --git a/gcc/testsuite/gcc.target/aarch64/sve/shift_rev_1.c 
>>> b/gcc/testsuite/gcc.target/aarch64/sve/shift_rev_1.c
>>> new file mode 100644
>>> index 00000000000..3a30f80d152
>>> --- /dev/null
>>> +++ b/gcc/testsuite/gcc.target/aarch64/sve/shift_rev_1.c
>>> @@ -0,0 +1,83 @@
>>> +/* { dg-do compile } */
>>> +/* { dg-options "-O3 -march=armv8.2-a+sve" } */
>>> +/* { dg-final { check-function-bodies "**" "" "" } } */
>>> +
>>> +#include <arm_sve.h>
>>> +
>>> +/*
>>> +** ror32_sve_lsl_imm:
>>> +**   ptrue   p3.b, all
>>> +**   revw    z0.d, p3/m, z0.d
>>> +**   ret
>>> +*/
>>> +svuint64_t
>>> +ror32_sve_lsl_imm (svuint64_t r)
>>> +{
>>> +  return svorr_u64_z (svptrue_b64 (), svlsl_n_u64_z (svptrue_b64 (), r, 
>>> 32),
>>> +                   svlsr_n_u64_z (svptrue_b64 (), r, 32));
>>> +}
>>> +
>>> +/*
>>> +** ror32_sve_lsl_operand:
>>> +**   ptrue   p3.b, all
>>> +**   revw    z0.d, p3/m, z0.d
>>> +**   ret
>>> +*/
>>> +svuint64_t
>>> +ror32_sve_lsl_operand (svuint64_t r)
>>> +{
>>> +  svbool_t pt = svptrue_b64 ();
>>> +  return svorr_u64_z (pt, svlsl_n_u64_z (pt, r, 32), svlsr_n_u64_z (pt, r, 
>>> 32));
>>> +}
>>> +
>>> +/*
>>> +** ror16_sve_lsl_imm:
>>> +**   ptrue   p3.b, all
>>> +**   revh    z0.s, p3/m, z0.s
>>> +**   ret
>>> +*/
>>> +svuint32_t
>>> +ror16_sve_lsl_imm (svuint32_t r)
>>> +{
>>> +  return svorr_u32_z (svptrue_b32 (), svlsl_n_u32_z (svptrue_b32 (), r, 
>>> 16),
>>> +                   svlsr_n_u32_z (svptrue_b32 (), r, 16));
>>> +}
>>> +
>>> +/*
>>> +** ror16_sve_lsl_operand:
>>> +**   ptrue   p3.b, all
>>> +**   revh    z0.s, p3/m, z0.s
>>> +**   ret
>>> +*/
>>> +svuint32_t
>>> +ror16_sve_lsl_operand (svuint32_t r)
>>> +{
>>> +  svbool_t pt = svptrue_b32 ();
>>> +  return svorr_u32_z (pt, svlsl_n_u32_z (pt, r, 16), svlsr_n_u32_z (pt, r, 
>>> 16));
>>> +}
>>> +
>>> +/*
>>> +** ror8_sve_lsl_imm:
>>> +**   ptrue   p3.b, all
>>> +**   revb    z0.h, p3/m, z0.h
>>> +**   ret
>>> +*/
>>> +svuint16_t
>>> +ror8_sve_lsl_imm (svuint16_t r)
>>> +{
>>> +  return svorr_u16_z (svptrue_b16 (), svlsl_n_u16_z (svptrue_b16 (), r, 8),
>>> +                   svlsr_n_u16_z (svptrue_b16 (), r, 8));
>>> +}
>>> +
>>> +/*
>>> +** ror8_sve_lsl_operand:
>>> +**   ptrue   p3.b, all
>>> +**   revb    z0.h, p3/m, z0.h
>>> +**   ret
>>> +*/
>>> +svuint16_t
>>> +ror8_sve_lsl_operand (svuint16_t r)
>>> +{
>>> +  svbool_t pt = svptrue_b16 ();
>>> +  return svorr_u16_z (pt, svlsl_n_u16_z (pt, r, 8), svlsr_n_u16_z (pt, r, 
>>> 8));
>>> +}
>> 
>> This only tests the revb/h/w case, not the lsl/usra fallback in the
>> split pattern.
>> 
>> The lsl/usra fallback generates pretty awful code if the rotated value
>> is passed in z0, so it might be worth adding an initial dummy argument
>> for that case.
>
> Would it be a good idea to add tests for the bad codegen as well? I have 
> added tests for lsl/usra in the next round of patches.

Nah, I don't think it's worth testing for something that we don't want.
If we can agree on what the "good" code would look like, it might be
worth adding that with an xfail, so that we notice when we start to get
the good codegen.  But IMO it's also fine to test only the dummy-argument
case, as in the new patch.

On the new patch:

> +/* { dg-final { scan-assembler-times "revb" 0 } } */
> +/* { dg-final { scan-assembler-times "revh" 0 } } */
> +/* { dg-final { scan-assembler-times "revw" 0 } } */

It would be better to add \ts around the mnemonics, so that we don't
accidentally match pieces of filenames that happen to contain "revb":

/* { dg-final { scan-assembler-times "\trevb\t" 0 } } */
/* { dg-final { scan-assembler-times "\trevh\t" 0 } } */
/* { dg-final { scan-assembler-times "\trevw\t" 0 } } */

Thanks,
Richard


diff --git a/gcc/config/aarch64/aarch64-sve-builtins-base.cc 
b/gcc/config/aarch64/aarch64-sve-builtins-base.cc
index 90dd5c97a10..b4396837c24 100644
--- a/gcc/config/aarch64/aarch64-sve-builtins-base.cc
+++ b/gcc/config/aarch64/aarch64-sve-builtins-base.cc
@@ -2086,37 +2086,6 @@ public:
   {
     return f.fold_const_binary (LSHIFT_EXPR);
   }
-
-  rtx expand (function_expander &e) const override
-  {
-    tree pred = TREE_OPERAND (e.call_expr, 3);
-    tree shift = TREE_OPERAND (e.call_expr, 5);
-    if (is_ptrue (pred, GET_MODE_UNIT_SIZE (e.result_mode ()))
-       && uniform_integer_cst_p (shift))
-      return e.use_unpred_insn (e.direct_optab_handler (ashl_optab));
-    return rtx_code_function::expand (e);
-  }
-};
-
-class svlsr_impl : public rtx_code_function
-{
-public:
-  CONSTEXPR svlsr_impl () : rtx_code_function (LSHIFTRT, LSHIFTRT) {}
-
-  gimple *fold (gimple_folder &f) const override
-  {
-    return f.fold_const_binary (RSHIFT_EXPR);
-  }
-
-  rtx expand (function_expander &e) const override
-  {
-    tree pred = TREE_OPERAND (e.call_expr, 3);
-    tree shift = TREE_OPERAND (e.call_expr, 5);
-    if (is_ptrue (pred, GET_MODE_UNIT_SIZE (e.result_mode ()))
-       && uniform_integer_cst_p (shift))
-      return e.use_unpred_insn (e.direct_optab_handler (lshr_optab));
-    return rtx_code_function::expand (e);
-  }
 };
 
 class svmad_impl : public function_base
@@ -3617,7 +3586,7 @@ FUNCTION (svldnt1, svldnt1_impl,)
 FUNCTION (svlen, svlen_impl,)
 FUNCTION (svlsl, svlsl_impl,)
 FUNCTION (svlsl_wide, shift_wide, (ASHIFT, UNSPEC_ASHIFT_WIDE))
-FUNCTION (svlsr, svlsr_impl,)
+FUNCTION (svlsr, rtx_code_function, (LSHIFTRT, LSHIFTRT))
 FUNCTION (svlsr_wide, shift_wide, (LSHIFTRT, UNSPEC_LSHIFTRT_WIDE))
 FUNCTION (svmad, svmad_impl,)
 FUNCTION (svmax, rtx_code_function, (SMAX, UMAX, UNSPEC_COND_FMAX,
diff --git a/gcc/config/aarch64/aarch64-sve.md 
b/gcc/config/aarch64/aarch64-sve.md
index 0156afc1e7d..fa431c9c060 100644
--- a/gcc/config/aarch64/aarch64-sve.md
+++ b/gcc/config/aarch64/aarch64-sve.md
@@ -4931,9 +4931,7 @@ (define_expand "<ASHIFT:optab><mode>3"
     if (CONST_INT_P (operands[2]))
       {
        amount = gen_const_vec_duplicate (<MODE>mode, operands[2]);
-       if (!aarch64_sve_<lr>shift_operand (operands[2], <MODE>mode)
-           && !aarch64_simd_shift_imm_p (operands[2], <MODE>mode,
-                                         <optab>_optab == ashl_optab))
+       if (!aarch64_sve_<lr>shift_operand (amount, <MODE>mode))
          amount = force_reg (<MODE>mode, amount);
       }
     else
@@ -4957,8 +4955,7 @@ (define_expand "v<optab><mode>3"
          UNSPEC_PRED_X))]
   "TARGET_SVE"
   {
-    if (aarch64_simd_shift_imm_p (operands[2], <MODE>mode,
-                                 <optab>_optab == ashl_optab))
+    if (CONSTANT_P (operands[2]))
       {
        emit_insn (gen_aarch64_v<optab><mode>3_const (operands[0], operands[1],
                                                      operands[2]));
@@ -4968,11 +4965,30 @@ (define_expand "v<optab><mode>3"
   }
 )
 
-;; Shift by a vector, predicated with a PTRUE.  We don't actually need
-;; the predicate for the first alternative, but using Upa or X isn't
-;; likely to gain much and would make the instruction seem less uniform
-;; to the register allocator.
-(define_insn_and_split "@aarch64_pred_<optab><mode>"
+;; Shift by a vector, predicated with a PTRUE.
+(define_expand "@aarch64_pred_<optab><mode>"
+  [(set (match_operand:SVE_I 0 "register_operand")
+       (unspec:SVE_I
+         [(match_operand:<VPRED> 1 "register_operand")
+          (ASHIFT:SVE_I
+            (match_operand:SVE_I 2 "register_operand")
+            (match_operand:SVE_I 3 "aarch64_sve_<lr>shift_operand"))]
+         UNSPEC_PRED_X))]
+  "TARGET_SVE"
+  {
+    if (CONSTANT_P (operands[3]))
+      {
+       emit_insn (gen_aarch64_v<optab><mode>3_const (operands[0], operands[2],
+                                                     operands[3]));
+       DONE;
+      }
+  }
+)
+
+;; We don't actually need the predicate for the first alternative, but
+;; using Upa or X isn't likely to gain much and would make the instruction
+;; seem less uniform to the register allocator.
+(define_insn_and_split "*aarch64_pred_<optab><mode>"
   [(set (match_operand:SVE_I 0 "register_operand")
        (unspec:SVE_I
          [(match_operand:<VPRED> 1 "register_operand")
@@ -4987,8 +5003,7 @@ (define_insn_and_split "@aarch64_pred_<optab><mode>"
      [ w        , Upl , w , 0     ; *              ] <shift>r\t%0.<Vetype>, 
%1/m, %3.<Vetype>, %2.<Vetype>
      [ ?&w      , Upl , w , w     ; yes            ] movprfx\t%0, 
%2\;<shift>\t%0.<Vetype>, %1/m, %0.<Vetype>, %3.<Vetype>
   }
-  "&& reload_completed
-   && !register_operand (operands[3], <MODE>mode)"
+  "&& !register_operand (operands[3], <MODE>mode)"
   [(set (match_dup 0) (ASHIFT:SVE_I (match_dup 2) (match_dup 3)))]
   ""
 )

Re: [PATCH 2/2] aarch64: Fold lsl+lsr+orr to rev for half-width shifts

Reply via email to