Re: [PATCH 4/6] aarch64: Optimize vector rotates into REV* instructions where possible

Kyrylo Tkachov Fri, 25 Oct 2024 04:24:25 -0700

Thank you for the suggestions! I’m trying them out now.

> On 24 Oct 2024, at 21:11, Richard Sandiford <richard.sandif...@arm.com> wrote:
> 
> Kyrylo Tkachov <ktkac...@nvidia.com> writes:
>> Hi Richard,
>> 
>>> On 23 Oct 2024, at 11:30, Richard Sandiford <richard.sandif...@arm.com> 
>>> wrote:
>>> 
>>> Kyrylo Tkachov <ktkac...@nvidia.com> writes:
>>>> Hi all,
>>>> 
>>>> Some vector rotate operations can be implemented in a single instruction
>>>> rather than using the fallback SHL+USRA sequence.
>>>> In particular, when the rotate amount is half the bitwidth of the element
>>>> we can use a REV64,REV32,REV16 instruction.
>>>> This patch adds this transformation in the recently added splitter for 
>>>> vector
>>>> rotates.
>>>> Bootstrapped and tested on aarch64-none-linux-gnu.
>>>> 
>>>> Signed-off-by: Kyrylo Tkachov <ktkac...@nvidia.com>
>>>> 
>>>> gcc/
>>>> 
>>>> * config/aarch64/aarch64-protos.h (aarch64_emit_opt_vec_rotate):
>>>> Declare prototype.
>>>> * config/aarch64/aarch64.cc (aarch64_emit_opt_vec_rotate): Implement.
>>>> * config/aarch64/aarch64-simd.md (*aarch64_simd_rotate_imm<mode>):
>>>> Call the above.
>>>> 
>>>> gcc/testsuite/
>>>> 
>>>> * gcc.target/aarch64/simd/pr117048_2.c: New test.
>>> 
>>> Sorry to be awkward, but I still think at least part of this should be
>>> target-independent.  Any rotate by a byte amount can be expressed as a
>>> vector permutation in a target-independent way.  Target-independent code
>>> can then use the usual optab routines to query whether the permutation
>>> is possible and/or try to generate it.
>> 
>> Thank you for elaborating. I had already prototyped the permute 
>> index-computing code in my tree
>> but was reluctant to using it during expand as I wanted the rotate RTX to be 
>> available for combining
>> into XAR so I felt a bit stuck. Having the code in a generic place but 
>> called from the backend at a
>> time of its choosing makes sense to me.
>> 
>>> 
>>> I can see that it probably makes sense to leave target code to make
>>> the decision about when to use the permutation strategy vs. other
>>> approaches.  But the code to implement that strategy shouldn't need
>>> to be target-specific.
>>> 
>>> E.g. we could have a routine:
>>> 
>>> expand_rotate_as_vec_perm
>>> 
>>> which checks whether the rotation amount is suitable and tries to
>>> generate the permutation if so.
>> 
>> I’ve implemented something like that in the attached patch.
>> It seems to work on AArch64 but as mentioned in the commit message I’d like 
>> a check on
>> the big-endian logic, and perhaps some pointers on how/whether it should be 
>> extended to
>> VLA vectors.
> 
> Great!  Thanks for doing this.  Some comments on those aspects below,
> but otherwise it LGTM.
> 
>> 
>> I’m updating the other patches in the series according to your feedback so 
>> I’ll repost them once I’m done,
>> just wanted to get this out for further iteration in the meantime.
>> Thanks,
>> Kyrill
>> 
>> 
>> 
>> 
>> 
>> 
>> From 6c1794a574b5525b3b495ed505621a8af029e825 Mon Sep 17 00:00:00 2001
>> From: Kyrylo Tkachov <ktkac...@nvidia.com>
>> Date: Wed, 16 Oct 2024 04:10:08 -0700
>> Subject: [PATCH] aarch64: Optimize vector rotates as vector permutes where
>> possible
>> 
>> Some vector rotate operations can be implemented in a single instruction
>> rather than using the fallback SHL+USRA sequence.
>> In particular, when the rotate amount is half the bitwidth of the element
>> we can use a REV64,REV32,REV16 instruction.
>> More generally, rotates by a byte amount can be implented using vector
>> permutes.
>> This patch adds such a generic routine in expmed.cc called
>> expand_rotate_as_vec_perm that calculates the required permute indices
>> and uses the expand_vec_perm_const interface.
>> 
>> On aarch64 this ends up generating the single-instruction sequences above
>> where possible and can use LDR+TBL sequences too, which are a good choice.
>> 
>> For now, I have restricted the expand_rotate_as_vec_perm to fixed-width modes
>> as I don't have much experience with using it for VLA modes, but I imagine
>> it's extendable there.  In any case, the only use of 
>> expand_rotate_as_vec_perm
>> is in aarch64-specific code that for now only handles fixed-width modes.
>> 
>> A runtime aarch64 test is added to ensure the permute indices are not messed
>> up.
>> I'd appreciate a review of the BYTES_BIG_ENDIAN logic.  I've adjusted the
>> permute vector indices in RTL for it and in the final AArch64 assembly the
>> final vector loaded from .LC<N> is identical for little and big-endian, which
>> I *think* is the correct behaviour.  For rotates by the half-width that 
>> should
>> generate single REV64, REV32 instructions aarch64 does not seem to recognise
>> them and falls back to an LDR+TBL for big-endian.  I'm not sure if that's
>> simply missing logic in the vector permute expansion in aarch64 or if I 
>> should
>> be passing a different permute vector.  If I delete the BYTES_BIG_ENDIAN hunk
>> in expand_rotate_as_vec_perm the permute indices are reversed, but the REV*
>> instructions are still not recognised.
> 
> Hmm, yeah, sounds like it might be missing logic, like you say.
> 
>> I don't have an easily accessible aarch64_be setup so I'd appreciate if 
>> someone
>> could run it there, hopefully the new execute test can help uncover any lane 
>> index
>> bugs.
> 
> Can't remember if the CI does aarch64_be or not.
> 
>> Bootstrapped and tested on aarch64-none-linux-gnu.
>> 
>> Signed-off-by: Kyrylo Tkachov <ktkac...@nvidia.com>
>> 
>> gcc/
>> 
>> * expmed.h (expand_rotate_as_vec_perm): Declare.
>> * expmed.cc (expand_rotate_as_vec_perm): Define.
>> * config/aarch64/aarch64-protos.h (aarch64_emit_opt_vec_rotate):
>> Declare prototype.
>> * config/aarch64/aarch64.cc (aarch64_emit_opt_vec_rotate): Implement.
>> * config/aarch64/aarch64-simd.md (*aarch64_simd_rotate_imm<mode>):
>> Call the above.
>> 
>> gcc/testsuite/
>> 
>> * gcc.target/aarch64/vec-rot-exec.c: New test.
>> * gcc.target/aarch64/simd/pr117048_2.c: New test.
>> ---
>> gcc/config/aarch64/aarch64-protos.h           |   1 +
>> gcc/config/aarch64/aarch64-simd.md            |   3 +
>> gcc/config/aarch64/aarch64.cc                 |  16 +++
>> gcc/expmed.cc                                 |  42 ++++++++
>> gcc/expmed.h                                  |   1 +
>> .../gcc.target/aarch64/simd/pr117048_2.c      |  66 ++++++++++++
>> .../gcc.target/aarch64/vec-rot-exec.c         | 101 ++++++++++++++++++
>> 7 files changed, 230 insertions(+)
>> create mode 100644 gcc/testsuite/gcc.target/aarch64/simd/pr117048_2.c
>> create mode 100644 gcc/testsuite/gcc.target/aarch64/vec-rot-exec.c
>> 
>> diff --git a/gcc/config/aarch64/aarch64-protos.h 
>> b/gcc/config/aarch64/aarch64-protos.h
>> index 75f30a52e61..fac7f9101a3 100644
>> --- a/gcc/config/aarch64/aarch64-protos.h
>> +++ b/gcc/config/aarch64/aarch64-protos.h
>> @@ -766,6 +766,7 @@ bool aarch64_rnd_imm_p (rtx);
>> bool aarch64_constant_address_p (rtx);
>> bool aarch64_emit_approx_div (rtx, rtx, rtx);
>> bool aarch64_emit_approx_sqrt (rtx, rtx, bool);
>> +bool aarch64_emit_opt_vec_rotate (rtx, rtx, rtx);
>> tree aarch64_vector_load_decl (tree);
>> rtx aarch64_gen_callee_cookie (aarch64_isa_mode, arm_pcs);
>> void aarch64_expand_call (rtx, rtx, rtx, bool);
>> diff --git a/gcc/config/aarch64/aarch64-simd.md 
>> b/gcc/config/aarch64/aarch64-simd.md
>> index 6f69ccb6eba..b0aa6e0aded 100644
>> --- a/gcc/config/aarch64/aarch64-simd.md
>> +++ b/gcc/config/aarch64/aarch64-simd.md
>> @@ -1313,6 +1313,9 @@
>>     (match_dup 4))
>>   (match_dup 3)))]
>>   {
>> +    if (aarch64_emit_opt_vec_rotate (operands[0], operands[1], operands[2]))
>> +      DONE;
>> +
>>     operands[3] = reload_completed ? operands[0] : gen_reg_rtx (<MODE>mode);
>>     rtx shft_amnt = unwrap_const_vec_duplicate (operands[2]);
>>     int bitwidth = GET_MODE_UNIT_SIZE (<MODE>mode) * BITS_PER_UNIT;
>> diff --git a/gcc/config/aarch64/aarch64.cc b/gcc/config/aarch64/aarch64.cc
>> index 7fbe3a7380c..5b64978daba 100644
>> --- a/gcc/config/aarch64/aarch64.cc
>> +++ b/gcc/config/aarch64/aarch64.cc
>> @@ -16027,6 +16027,22 @@ aarch64_emit_approx_div (rtx quo, rtx num, rtx den)
>>   return true;
>> }
>> 
>> +/* Emit an optimized sequence to perform a vector rotate
>> +   of REG by the vector constant amount AMNT and place the result
>> +   in DST.  Return true iff successful.  */
>> +
>> +bool
>> +aarch64_emit_opt_vec_rotate (rtx dst, rtx reg, rtx amnt)
>> +{
>> +  machine_mode mode = GET_MODE (reg);
>> +  /* Attempt to expand the rotate as a vector permute.
>> +     For some rotate amounts they can be single instructions and
>> +     even the general single-vector TBL permute has good throughput.  */
>> +  if (expand_rotate_as_vec_perm (mode, dst, reg, amnt))
>> +    return true;
>> +  return false;
>> +}
>> +
>> /* Return the number of instructions that can be issued per cycle.  */
>> static int
>> aarch64_sched_issue_rate (void)
>> diff --git a/gcc/expmed.cc b/gcc/expmed.cc
>> index 4498f9f38a8..1d7d6a7a914 100644
>> --- a/gcc/expmed.cc
>> +++ b/gcc/expmed.cc
>> @@ -6286,6 +6286,48 @@ emit_store_flag_force (rtx target, enum rtx_code 
>> code, rtx op0, rtx op1,
>>   return target;
>> }
>> 
>> +/* Expand a vector (left) rotate of MODE of X by an immediate AMT as a 
>> vector
>> +   permute operation.  Emit code to put the result in DST if successfull and
>> +   return it.  Otherwise return NULL.  This is intended to implement vector
>> +   rotates by byte amounts using vector permutes when the target does not 
>> offer
>> +   native vector rotate operations.  */
>> +rtx
>> +expand_rotate_as_vec_perm (machine_mode mode, rtx dst, rtx x, rtx amt)
>> +{
>> +  int rotamnt = INTVAL (unwrap_const_vec_duplicate (amt));
> 
> It's probably worth making this even more general by checking for
> CONST_INT_P on the result of unwrap_const_vec_duplicate, to cope
> gracefully with variable amounts.  (Or poly_int amounts :))
> 
>> +  if (rotamnt % BITS_PER_UNIT != 0)
>> +    return NULL_RTX;
>> +  machine_mode qimode;
>> +  if (!qimode_for_vec_perm (mode).exists (&qimode))
>> +    return NULL_RTX;
>> +
>> +  vec_perm_builder builder;
>> +  unsigned nunits = GET_MODE_SIZE (GET_MODE_INNER (mode));
> 
> simpler as GET_MODE_UNIT_SIZE
> 
>> +  unsigned total_units;
>> +  /* TODO: Handle VLA vector rotates?  */
>> +  if (!GET_MODE_SIZE (mode).is_constant (&total_units))
>> +    return NULL_RTX;
> 
> Yeah.  I think we can do that by changing:
> 
>> +  builder.new_vector (total_units, 1, total_units);
> 
> to:
> 
>  builder.new_vector (total_units, 3, units);


I think units here is the size in units of the fixed-width component of the 
mode? So e.g. 16 for V4SI and VNx4SI but 8 for V4HI and VN4HI?
What is the recommended API for getting that number out of the poly_uint64 mode 
size?. Is it just accessing coeffs[0]?

> 
> unconditionally and making the outer loop below iterate exactly
> three times (i.e. to nunits * 3).  It's ok if we generate more
> indices than needed.
> 
>> +  int rot_to_perm = nunits - rotamnt / BITS_PER_UNIT;
>> +  for (unsigned j = 0; j < total_units; j += nunits)
>> +    for (unsigned i = 0; i < nunits; i++)
>> +      {
>> + unsigned idx = (rot_to_perm + i) % nunits + j;
>> + if (BYTES_BIG_ENDIAN)
>> +   idx = total_units - idx - 1;
> 
> I think the endian adjustment should be local to the inner loop/vector
> element only.  Since this would mean undoing the "nunits - " adjustment
> above, how about something like:
> 
>  unsigned rot_bytes = rotamnt / BITS_PER_UNIT;
>  unsigned rot_to_perm = BYTES_BIG_ENDIAN ? rot_bytes : nunits - rot_bytes;
>  ...
>      builder.quick_push ((rot_to_perm + i) % nunits + j);
> 
> or whatever variation you prefer.
> 
> Hope I've got that right...


Hmm, I’m getting some test failures and wrong indices when I try this.
I think I can work out the indices and the loops for them but I’d like to work
through an example. So say we are rotating a V4SImode vector by 16 (a REV32 
instruction).
The indices pushed into the byte permute vector with my original patch are:
[2,3,0,1, 6,7,4,5, a,b,8,9, e,f,c,d]
What sequence do we want to push for V4SImode now that we have 3 patterns in 
vector_builder?
Is it repeating the above 3 times or is it interleaving each SImode entry 
somehow?

Thanks,
Kyrill

> 
> OK with those changes for the target-independent parts (and the
> target-specific parts LGTM too FWIW).
> 
> Thanks,
> Richard
> 
>> + builder.quick_push (idx);
>> +      }
>> +  rtx perm_src = lowpart_subreg (qimode, x, mode);
>> +  rtx perm_dst = lowpart_subreg (qimode, dst, mode);
>> +  rtx res
>> +    = expand_vec_perm_const (qimode, perm_src, perm_src, builder,
>> +      qimode, perm_dst);
>> +  if (!res)
>> +    return NULL_RTX;
>> +  emit_move_insn (dst, lowpart_subreg (mode, res, qimode));
>> +  return dst;
>> +}
>> +
>> /* Helper function for canonicalize_cmp_for_target.  Swap between inclusive
>>    and exclusive ranges in order to create an equivalent comparison.  See
>>    canonicalize_cmp_for_target for the possible cases.  */
>> diff --git a/gcc/expmed.h b/gcc/expmed.h
>> index 0a19176b77a..2414c3cb27d 100644
>> --- a/gcc/expmed.h
>> +++ b/gcc/expmed.h
>> @@ -726,5 +726,6 @@ extern rtx expand_mult_highpart_adjust (scalar_int_mode, 
>> rtx, rtx, rtx,
>> rtx, int);
>> extern rtx expmed_mult_highpart_optab (scalar_int_mode, rtx, rtx, rtx,
>>        int, int);
>> +extern rtx expand_rotate_as_vec_perm (machine_mode, rtx, rtx, rtx);
>> 
>> #endif  // EXPMED_H
>> diff --git a/gcc/testsuite/gcc.target/aarch64/simd/pr117048_2.c 
>> b/gcc/testsuite/gcc.target/aarch64/simd/pr117048_2.c
>> new file mode 100644
>> index 00000000000..7baf3581870
>> --- /dev/null
>> +++ b/gcc/testsuite/gcc.target/aarch64/simd/pr117048_2.c
>> @@ -0,0 +1,66 @@
>> +/* { dg-do compile } */
>> +/* { dg-options "-O2 -mlittle-endian" } */
>> +/* { dg-final { check-function-bodies "**" "" "" } } */
>> +
>> +typedef char __attribute__ ((vector_size (16))) v16qi;
>> +typedef unsigned short __attribute__ ((vector_size (16))) v8hi;
>> +typedef unsigned int __attribute__ ((vector_size (16))) v4si;
>> +typedef unsigned long long __attribute__ ((vector_size (16))) v2di;
>> +typedef unsigned short __attribute__ ((vector_size (8))) v4hi;
>> +typedef unsigned int __attribute__ ((vector_size (8))) v2si;
>> +
>> +/*
>> +** G1:
>> +** rev64 v0\.4s, v0\.4s
>> +** ret 
>> +*/
>> +v2di
>> +G1 (v2di r)
>> +{
>> +  return (r >> 32) | (r << 32);
>> +}
>> +
>> +/*
>> +** G2:
>> +** rev32 v0\.8h, v0\.8h
>> +** ret 
>> +*/
>> +v4si
>> +G2 (v4si r)
>> +{
>> +  return (r >> 16) | (r << 16);
>> +}
>> +
>> +/*
>> +** G3:
>> +** rev16 v0\.16b, v0\.16b
>> +** ret 
>> +*/
>> +v8hi
>> +G3 (v8hi r)
>> +{
>> +  return (r >> 8) | (r << 8);
>> +}
>> +
>> +/*
>> +** G4:
>> +** rev32 v0\.4h, v0\.4h
>> +** ret 
>> +*/
>> +v2si
>> +G4 (v2si r)
>> +{
>> +  return (r >> 16) | (r << 16);
>> +}
>> +
>> +/*
>> +** G5:
>> +** rev16 v0\.8b, v0\.8b
>> +** ret 
>> +*/
>> +v4hi
>> +G5 (v4hi r)
>> +{
>> +  return (r >> 8) | (r << 8);
>> +}
>> +
>> diff --git a/gcc/testsuite/gcc.target/aarch64/vec-rot-exec.c 
>> b/gcc/testsuite/gcc.target/aarch64/vec-rot-exec.c
>> new file mode 100644
>> index 00000000000..130a9b1aa64
>> --- /dev/null
>> +++ b/gcc/testsuite/gcc.target/aarch64/vec-rot-exec.c
>> @@ -0,0 +1,101 @@
>> +/* { dg-do run } */
>> +/* { dg-options "-O2" } */
>> +
>> +typedef char __attribute__ ((vector_size (16))) v16qi;
>> +typedef unsigned short __attribute__ ((vector_size (16))) v8hi;
>> +typedef unsigned int __attribute__ ((vector_size (16))) v4si;
>> +typedef unsigned long long __attribute__ ((vector_size (16))) v2di;
>> +typedef char __attribute__ ((vector_size (8))) v8qi;
>> +typedef unsigned short __attribute__ ((vector_size (8))) v4hi;
>> +typedef unsigned int __attribute__ ((vector_size (8))) v2si;
>> +#define VEC_ELTS(X) (sizeof (X) / (sizeof (X[0])))
>> +
>> +static const char __attribute__ ((aligned (16))) *str = 
>> "abcdefghijklmnopqrstuvwxyz";
>> +
>> +unsigned long long
>> +__attribute__((noipa,noinline)) 
>> +rot_64_one (unsigned long long x, unsigned amt)
>> +{
>> +  return (x << amt) | (x >> (64 - amt));
>> +}
>> +unsigned int
>> +__attribute__((noipa,noinline)) 
>> +rot_32_one (unsigned int x, unsigned amt)
>> +{
>> +  return (x << amt) | (x >> (32 - amt));
>> +}
>> +
>> +unsigned short
>> +__attribute__((noipa,noinline)) 
>> +rot_16_one (unsigned short x, unsigned short amt)
>> +{
>> +  return (x << amt) | (x >> (16 - amt));
>> +}
>> +
>> +
>> +#define ROTFUNC(M,S,A) \
>> +M \
>> +__attribute__((noipa,noinline))  \
>> +rot_##M##_##S##_##A (M x) \
>> +{ \
>> +  return (x << A) | (x >> (S - A)); \
>> +} \
>> + \
>> +void \
>> +test_rot_##M##_##S##_##A (void) \
>> +{ \
>> +  M vec = *(M *)str; \
>> +  M res = rot_##M##_##S##_##A (vec); \
>> +  for (__SIZE_TYPE__ i = 0; i < VEC_ELTS (vec); i++) \
>> +    if (res[i] != rot_##S##_one (vec[i], A)) \
>> +      __builtin_abort (); \
>> +}
>> +
>> +ROTFUNC (v2di, 64, 56)
>> +ROTFUNC (v2di, 64, 48)
>> +ROTFUNC (v2di, 64, 40)
>> +ROTFUNC (v2di, 64, 32)
>> +ROTFUNC (v2di, 64, 24)
>> +ROTFUNC (v2di, 64, 16)
>> +ROTFUNC (v2di, 64, 8)
>> +
>> +ROTFUNC (v4si, 32, 24)
>> +ROTFUNC (v4si, 32, 16)
>> +ROTFUNC (v4si, 32, 8)
>> +
>> +ROTFUNC (v8hi, 16, 8)
>> +
>> +ROTFUNC (v2si, 32, 24)
>> +ROTFUNC (v2si, 32, 16)
>> +ROTFUNC (v2si, 32, 8)
>> +
>> +ROTFUNC (v4hi, 16, 8)
>> +
>> +#define CALL_TEST(M,S,A) test_rot_##M##_##S##_##A ()
>> +
>> +int
>> +main (void)
>> +{
>> +  CALL_TEST (v2di, 64, 56);
>> +  CALL_TEST (v2di, 64, 48);
>> +  CALL_TEST (v2di, 64, 40);
>> +  CALL_TEST (v2di, 64, 32);
>> +  CALL_TEST (v2di, 64, 24);
>> +  CALL_TEST (v2di, 64, 16);
>> +  CALL_TEST (v2di, 64, 8);
>> +
>> +  CALL_TEST (v4si, 32, 24);
>> +  CALL_TEST (v4si, 32, 16);
>> +  CALL_TEST (v4si, 32, 8);
>> +
>> +  CALL_TEST (v8hi, 16, 8);
>> +
>> +  CALL_TEST (v2si, 32, 24);
>> +  CALL_TEST (v2si, 32, 16);
>> +  CALL_TEST (v2si, 32, 8);
>> +
>> +  CALL_TEST (v4hi, 16, 8);
>> +
>> +  return 0;
>> +}
>> +

Re: [PATCH 4/6] aarch64: Optimize vector rotates into REV* instructions where possible

Reply via email to