Thank you for the suggestions! I’m trying them out now.
> On 24 Oct 2024, at 21:11, Richard Sandiford <richard.sandif...@arm.com> wrote:
>
> Kyrylo Tkachov <ktkac...@nvidia.com> writes:
>> Hi Richard,
>>
>>> On 23 Oct 2024, at 11:30, Richard Sandiford <richard.sandif...@arm.com>
>>> wrote:
>>>
>>> Kyrylo Tkachov <ktkac...@nvidia.com> writes:
>>>> Hi all,
>>>>
>>>> Some vector rotate operations can be implemented in a single instruction
>>>> rather than using the fallback SHL+USRA sequence.
>>>> In particular, when the rotate amount is half the bitwidth of the element
>>>> we can use a REV64,REV32,REV16 instruction.
>>>> This patch adds this transformation in the recently added splitter for
>>>> vector
>>>> rotates.
>>>> Bootstrapped and tested on aarch64-none-linux-gnu.
>>>>
>>>> Signed-off-by: Kyrylo Tkachov <ktkac...@nvidia.com>
>>>>
>>>> gcc/
>>>>
>>>> * config/aarch64/aarch64-protos.h (aarch64_emit_opt_vec_rotate):
>>>> Declare prototype.
>>>> * config/aarch64/aarch64.cc (aarch64_emit_opt_vec_rotate): Implement.
>>>> * config/aarch64/aarch64-simd.md (*aarch64_simd_rotate_imm<mode>):
>>>> Call the above.
>>>>
>>>> gcc/testsuite/
>>>>
>>>> * gcc.target/aarch64/simd/pr117048_2.c: New test.
>>>
>>> Sorry to be awkward, but I still think at least part of this should be
>>> target-independent. Any rotate by a byte amount can be expressed as a
>>> vector permutation in a target-independent way. Target-independent code
>>> can then use the usual optab routines to query whether the permutation
>>> is possible and/or try to generate it.
>>
>> Thank you for elaborating. I had already prototyped the permute
>> index-computing code in my tree
>> but was reluctant to using it during expand as I wanted the rotate RTX to be
>> available for combining
>> into XAR so I felt a bit stuck. Having the code in a generic place but
>> called from the backend at a
>> time of its choosing makes sense to me.
>>
>>>
>>> I can see that it probably makes sense to leave target code to make
>>> the decision about when to use the permutation strategy vs. other
>>> approaches. But the code to implement that strategy shouldn't need
>>> to be target-specific.
>>>
>>> E.g. we could have a routine:
>>>
>>> expand_rotate_as_vec_perm
>>>
>>> which checks whether the rotation amount is suitable and tries to
>>> generate the permutation if so.
>>
>> I’ve implemented something like that in the attached patch.
>> It seems to work on AArch64 but as mentioned in the commit message I’d like
>> a check on
>> the big-endian logic, and perhaps some pointers on how/whether it should be
>> extended to
>> VLA vectors.
>
> Great! Thanks for doing this. Some comments on those aspects below,
> but otherwise it LGTM.
>
>>
>> I’m updating the other patches in the series according to your feedback so
>> I’ll repost them once I’m done,
>> just wanted to get this out for further iteration in the meantime.
>> Thanks,
>> Kyrill
>>
>>
>>
>>
>>
>>
>> From 6c1794a574b5525b3b495ed505621a8af029e825 Mon Sep 17 00:00:00 2001
>> From: Kyrylo Tkachov <ktkac...@nvidia.com>
>> Date: Wed, 16 Oct 2024 04:10:08 -0700
>> Subject: [PATCH] aarch64: Optimize vector rotates as vector permutes where
>> possible
>>
>> Some vector rotate operations can be implemented in a single instruction
>> rather than using the fallback SHL+USRA sequence.
>> In particular, when the rotate amount is half the bitwidth of the element
>> we can use a REV64,REV32,REV16 instruction.
>> More generally, rotates by a byte amount can be implented using vector
>> permutes.
>> This patch adds such a generic routine in expmed.cc called
>> expand_rotate_as_vec_perm that calculates the required permute indices
>> and uses the expand_vec_perm_const interface.
>>
>> On aarch64 this ends up generating the single-instruction sequences above
>> where possible and can use LDR+TBL sequences too, which are a good choice.
>>
>> For now, I have restricted the expand_rotate_as_vec_perm to fixed-width modes
>> as I don't have much experience with using it for VLA modes, but I imagine
>> it's extendable there. In any case, the only use of
>> expand_rotate_as_vec_perm
>> is in aarch64-specific code that for now only handles fixed-width modes.
>>
>> A runtime aarch64 test is added to ensure the permute indices are not messed
>> up.
>> I'd appreciate a review of the BYTES_BIG_ENDIAN logic. I've adjusted the
>> permute vector indices in RTL for it and in the final AArch64 assembly the
>> final vector loaded from .LC<N> is identical for little and big-endian, which
>> I *think* is the correct behaviour. For rotates by the half-width that
>> should
>> generate single REV64, REV32 instructions aarch64 does not seem to recognise
>> them and falls back to an LDR+TBL for big-endian. I'm not sure if that's
>> simply missing logic in the vector permute expansion in aarch64 or if I
>> should
>> be passing a different permute vector. If I delete the BYTES_BIG_ENDIAN hunk
>> in expand_rotate_as_vec_perm the permute indices are reversed, but the REV*
>> instructions are still not recognised.
>
> Hmm, yeah, sounds like it might be missing logic, like you say.
>
>> I don't have an easily accessible aarch64_be setup so I'd appreciate if
>> someone
>> could run it there, hopefully the new execute test can help uncover any lane
>> index
>> bugs.
>
> Can't remember if the CI does aarch64_be or not.
>
>> Bootstrapped and tested on aarch64-none-linux-gnu.
>>
>> Signed-off-by: Kyrylo Tkachov <ktkac...@nvidia.com>
>>
>> gcc/
>>
>> * expmed.h (expand_rotate_as_vec_perm): Declare.
>> * expmed.cc (expand_rotate_as_vec_perm): Define.
>> * config/aarch64/aarch64-protos.h (aarch64_emit_opt_vec_rotate):
>> Declare prototype.
>> * config/aarch64/aarch64.cc (aarch64_emit_opt_vec_rotate): Implement.
>> * config/aarch64/aarch64-simd.md (*aarch64_simd_rotate_imm<mode>):
>> Call the above.
>>
>> gcc/testsuite/
>>
>> * gcc.target/aarch64/vec-rot-exec.c: New test.
>> * gcc.target/aarch64/simd/pr117048_2.c: New test.
>> ---
>> gcc/config/aarch64/aarch64-protos.h | 1 +
>> gcc/config/aarch64/aarch64-simd.md | 3 +
>> gcc/config/aarch64/aarch64.cc | 16 +++
>> gcc/expmed.cc | 42 ++++++++
>> gcc/expmed.h | 1 +
>> .../gcc.target/aarch64/simd/pr117048_2.c | 66 ++++++++++++
>> .../gcc.target/aarch64/vec-rot-exec.c | 101 ++++++++++++++++++
>> 7 files changed, 230 insertions(+)
>> create mode 100644 gcc/testsuite/gcc.target/aarch64/simd/pr117048_2.c
>> create mode 100644 gcc/testsuite/gcc.target/aarch64/vec-rot-exec.c
>>
>> diff --git a/gcc/config/aarch64/aarch64-protos.h
>> b/gcc/config/aarch64/aarch64-protos.h
>> index 75f30a52e61..fac7f9101a3 100644
>> --- a/gcc/config/aarch64/aarch64-protos.h
>> +++ b/gcc/config/aarch64/aarch64-protos.h
>> @@ -766,6 +766,7 @@ bool aarch64_rnd_imm_p (rtx);
>> bool aarch64_constant_address_p (rtx);
>> bool aarch64_emit_approx_div (rtx, rtx, rtx);
>> bool aarch64_emit_approx_sqrt (rtx, rtx, bool);
>> +bool aarch64_emit_opt_vec_rotate (rtx, rtx, rtx);
>> tree aarch64_vector_load_decl (tree);
>> rtx aarch64_gen_callee_cookie (aarch64_isa_mode, arm_pcs);
>> void aarch64_expand_call (rtx, rtx, rtx, bool);
>> diff --git a/gcc/config/aarch64/aarch64-simd.md
>> b/gcc/config/aarch64/aarch64-simd.md
>> index 6f69ccb6eba..b0aa6e0aded 100644
>> --- a/gcc/config/aarch64/aarch64-simd.md
>> +++ b/gcc/config/aarch64/aarch64-simd.md
>> @@ -1313,6 +1313,9 @@
>> (match_dup 4))
>> (match_dup 3)))]
>> {
>> + if (aarch64_emit_opt_vec_rotate (operands[0], operands[1], operands[2]))
>> + DONE;
>> +
>> operands[3] = reload_completed ? operands[0] : gen_reg_rtx (<MODE>mode);
>> rtx shft_amnt = unwrap_const_vec_duplicate (operands[2]);
>> int bitwidth = GET_MODE_UNIT_SIZE (<MODE>mode) * BITS_PER_UNIT;
>> diff --git a/gcc/config/aarch64/aarch64.cc b/gcc/config/aarch64/aarch64.cc
>> index 7fbe3a7380c..5b64978daba 100644
>> --- a/gcc/config/aarch64/aarch64.cc
>> +++ b/gcc/config/aarch64/aarch64.cc
>> @@ -16027,6 +16027,22 @@ aarch64_emit_approx_div (rtx quo, rtx num, rtx den)
>> return true;
>> }
>>
>> +/* Emit an optimized sequence to perform a vector rotate
>> + of REG by the vector constant amount AMNT and place the result
>> + in DST. Return true iff successful. */
>> +
>> +bool
>> +aarch64_emit_opt_vec_rotate (rtx dst, rtx reg, rtx amnt)
>> +{
>> + machine_mode mode = GET_MODE (reg);
>> + /* Attempt to expand the rotate as a vector permute.
>> + For some rotate amounts they can be single instructions and
>> + even the general single-vector TBL permute has good throughput. */
>> + if (expand_rotate_as_vec_perm (mode, dst, reg, amnt))
>> + return true;
>> + return false;
>> +}
>> +
>> /* Return the number of instructions that can be issued per cycle. */
>> static int
>> aarch64_sched_issue_rate (void)
>> diff --git a/gcc/expmed.cc b/gcc/expmed.cc
>> index 4498f9f38a8..1d7d6a7a914 100644
>> --- a/gcc/expmed.cc
>> +++ b/gcc/expmed.cc
>> @@ -6286,6 +6286,48 @@ emit_store_flag_force (rtx target, enum rtx_code
>> code, rtx op0, rtx op1,
>> return target;
>> }
>>
>> +/* Expand a vector (left) rotate of MODE of X by an immediate AMT as a
>> vector
>> + permute operation. Emit code to put the result in DST if successfull and
>> + return it. Otherwise return NULL. This is intended to implement vector
>> + rotates by byte amounts using vector permutes when the target does not
>> offer
>> + native vector rotate operations. */
>> +rtx
>> +expand_rotate_as_vec_perm (machine_mode mode, rtx dst, rtx x, rtx amt)
>> +{
>> + int rotamnt = INTVAL (unwrap_const_vec_duplicate (amt));
>
> It's probably worth making this even more general by checking for
> CONST_INT_P on the result of unwrap_const_vec_duplicate, to cope
> gracefully with variable amounts. (Or poly_int amounts :))
>
>> + if (rotamnt % BITS_PER_UNIT != 0)
>> + return NULL_RTX;
>> + machine_mode qimode;
>> + if (!qimode_for_vec_perm (mode).exists (&qimode))
>> + return NULL_RTX;
>> +
>> + vec_perm_builder builder;
>> + unsigned nunits = GET_MODE_SIZE (GET_MODE_INNER (mode));
>
> simpler as GET_MODE_UNIT_SIZE
>
>> + unsigned total_units;
>> + /* TODO: Handle VLA vector rotates? */
>> + if (!GET_MODE_SIZE (mode).is_constant (&total_units))
>> + return NULL_RTX;
>
> Yeah. I think we can do that by changing:
>
>> + builder.new_vector (total_units, 1, total_units);
>
> to:
>
> builder.new_vector (total_units, 3, units);
I think units here is the size in units of the fixed-width component of the
mode? So e.g. 16 for V4SI and VNx4SI but 8 for V4HI and VN4HI?
What is the recommended API for getting that number out of the poly_uint64 mode
size?. Is it just accessing coeffs[0]?
>
> unconditionally and making the outer loop below iterate exactly
> three times (i.e. to nunits * 3). It's ok if we generate more
> indices than needed.
>
>> + int rot_to_perm = nunits - rotamnt / BITS_PER_UNIT;
>> + for (unsigned j = 0; j < total_units; j += nunits)
>> + for (unsigned i = 0; i < nunits; i++)
>> + {
>> + unsigned idx = (rot_to_perm + i) % nunits + j;
>> + if (BYTES_BIG_ENDIAN)
>> + idx = total_units - idx - 1;
>
> I think the endian adjustment should be local to the inner loop/vector
> element only. Since this would mean undoing the "nunits - " adjustment
> above, how about something like:
>
> unsigned rot_bytes = rotamnt / BITS_PER_UNIT;
> unsigned rot_to_perm = BYTES_BIG_ENDIAN ? rot_bytes : nunits - rot_bytes;
> ...
> builder.quick_push ((rot_to_perm + i) % nunits + j);
>
> or whatever variation you prefer.
>
> Hope I've got that right...
Hmm, I’m getting some test failures and wrong indices when I try this.
I think I can work out the indices and the loops for them but I’d like to work
through an example. So say we are rotating a V4SImode vector by 16 (a REV32
instruction).
The indices pushed into the byte permute vector with my original patch are:
[2,3,0,1, 6,7,4,5, a,b,8,9, e,f,c,d]
What sequence do we want to push for V4SImode now that we have 3 patterns in
vector_builder?
Is it repeating the above 3 times or is it interleaving each SImode entry
somehow?
Thanks,
Kyrill
>
> OK with those changes for the target-independent parts (and the
> target-specific parts LGTM too FWIW).
>
> Thanks,
> Richard
>
>> + builder.quick_push (idx);
>> + }
>> + rtx perm_src = lowpart_subreg (qimode, x, mode);
>> + rtx perm_dst = lowpart_subreg (qimode, dst, mode);
>> + rtx res
>> + = expand_vec_perm_const (qimode, perm_src, perm_src, builder,
>> + qimode, perm_dst);
>> + if (!res)
>> + return NULL_RTX;
>> + emit_move_insn (dst, lowpart_subreg (mode, res, qimode));
>> + return dst;
>> +}
>> +
>> /* Helper function for canonicalize_cmp_for_target. Swap between inclusive
>> and exclusive ranges in order to create an equivalent comparison. See
>> canonicalize_cmp_for_target for the possible cases. */
>> diff --git a/gcc/expmed.h b/gcc/expmed.h
>> index 0a19176b77a..2414c3cb27d 100644
>> --- a/gcc/expmed.h
>> +++ b/gcc/expmed.h
>> @@ -726,5 +726,6 @@ extern rtx expand_mult_highpart_adjust (scalar_int_mode,
>> rtx, rtx, rtx,
>> rtx, int);
>> extern rtx expmed_mult_highpart_optab (scalar_int_mode, rtx, rtx, rtx,
>> int, int);
>> +extern rtx expand_rotate_as_vec_perm (machine_mode, rtx, rtx, rtx);
>>
>> #endif // EXPMED_H
>> diff --git a/gcc/testsuite/gcc.target/aarch64/simd/pr117048_2.c
>> b/gcc/testsuite/gcc.target/aarch64/simd/pr117048_2.c
>> new file mode 100644
>> index 00000000000..7baf3581870
>> --- /dev/null
>> +++ b/gcc/testsuite/gcc.target/aarch64/simd/pr117048_2.c
>> @@ -0,0 +1,66 @@
>> +/* { dg-do compile } */
>> +/* { dg-options "-O2 -mlittle-endian" } */
>> +/* { dg-final { check-function-bodies "**" "" "" } } */
>> +
>> +typedef char __attribute__ ((vector_size (16))) v16qi;
>> +typedef unsigned short __attribute__ ((vector_size (16))) v8hi;
>> +typedef unsigned int __attribute__ ((vector_size (16))) v4si;
>> +typedef unsigned long long __attribute__ ((vector_size (16))) v2di;
>> +typedef unsigned short __attribute__ ((vector_size (8))) v4hi;
>> +typedef unsigned int __attribute__ ((vector_size (8))) v2si;
>> +
>> +/*
>> +** G1:
>> +** rev64 v0\.4s, v0\.4s
>> +** ret
>> +*/
>> +v2di
>> +G1 (v2di r)
>> +{
>> + return (r >> 32) | (r << 32);
>> +}
>> +
>> +/*
>> +** G2:
>> +** rev32 v0\.8h, v0\.8h
>> +** ret
>> +*/
>> +v4si
>> +G2 (v4si r)
>> +{
>> + return (r >> 16) | (r << 16);
>> +}
>> +
>> +/*
>> +** G3:
>> +** rev16 v0\.16b, v0\.16b
>> +** ret
>> +*/
>> +v8hi
>> +G3 (v8hi r)
>> +{
>> + return (r >> 8) | (r << 8);
>> +}
>> +
>> +/*
>> +** G4:
>> +** rev32 v0\.4h, v0\.4h
>> +** ret
>> +*/
>> +v2si
>> +G4 (v2si r)
>> +{
>> + return (r >> 16) | (r << 16);
>> +}
>> +
>> +/*
>> +** G5:
>> +** rev16 v0\.8b, v0\.8b
>> +** ret
>> +*/
>> +v4hi
>> +G5 (v4hi r)
>> +{
>> + return (r >> 8) | (r << 8);
>> +}
>> +
>> diff --git a/gcc/testsuite/gcc.target/aarch64/vec-rot-exec.c
>> b/gcc/testsuite/gcc.target/aarch64/vec-rot-exec.c
>> new file mode 100644
>> index 00000000000..130a9b1aa64
>> --- /dev/null
>> +++ b/gcc/testsuite/gcc.target/aarch64/vec-rot-exec.c
>> @@ -0,0 +1,101 @@
>> +/* { dg-do run } */
>> +/* { dg-options "-O2" } */
>> +
>> +typedef char __attribute__ ((vector_size (16))) v16qi;
>> +typedef unsigned short __attribute__ ((vector_size (16))) v8hi;
>> +typedef unsigned int __attribute__ ((vector_size (16))) v4si;
>> +typedef unsigned long long __attribute__ ((vector_size (16))) v2di;
>> +typedef char __attribute__ ((vector_size (8))) v8qi;
>> +typedef unsigned short __attribute__ ((vector_size (8))) v4hi;
>> +typedef unsigned int __attribute__ ((vector_size (8))) v2si;
>> +#define VEC_ELTS(X) (sizeof (X) / (sizeof (X[0])))
>> +
>> +static const char __attribute__ ((aligned (16))) *str =
>> "abcdefghijklmnopqrstuvwxyz";
>> +
>> +unsigned long long
>> +__attribute__((noipa,noinline))
>> +rot_64_one (unsigned long long x, unsigned amt)
>> +{
>> + return (x << amt) | (x >> (64 - amt));
>> +}
>> +unsigned int
>> +__attribute__((noipa,noinline))
>> +rot_32_one (unsigned int x, unsigned amt)
>> +{
>> + return (x << amt) | (x >> (32 - amt));
>> +}
>> +
>> +unsigned short
>> +__attribute__((noipa,noinline))
>> +rot_16_one (unsigned short x, unsigned short amt)
>> +{
>> + return (x << amt) | (x >> (16 - amt));
>> +}
>> +
>> +
>> +#define ROTFUNC(M,S,A) \
>> +M \
>> +__attribute__((noipa,noinline)) \
>> +rot_##M##_##S##_##A (M x) \
>> +{ \
>> + return (x << A) | (x >> (S - A)); \
>> +} \
>> + \
>> +void \
>> +test_rot_##M##_##S##_##A (void) \
>> +{ \
>> + M vec = *(M *)str; \
>> + M res = rot_##M##_##S##_##A (vec); \
>> + for (__SIZE_TYPE__ i = 0; i < VEC_ELTS (vec); i++) \
>> + if (res[i] != rot_##S##_one (vec[i], A)) \
>> + __builtin_abort (); \
>> +}
>> +
>> +ROTFUNC (v2di, 64, 56)
>> +ROTFUNC (v2di, 64, 48)
>> +ROTFUNC (v2di, 64, 40)
>> +ROTFUNC (v2di, 64, 32)
>> +ROTFUNC (v2di, 64, 24)
>> +ROTFUNC (v2di, 64, 16)
>> +ROTFUNC (v2di, 64, 8)
>> +
>> +ROTFUNC (v4si, 32, 24)
>> +ROTFUNC (v4si, 32, 16)
>> +ROTFUNC (v4si, 32, 8)
>> +
>> +ROTFUNC (v8hi, 16, 8)
>> +
>> +ROTFUNC (v2si, 32, 24)
>> +ROTFUNC (v2si, 32, 16)
>> +ROTFUNC (v2si, 32, 8)
>> +
>> +ROTFUNC (v4hi, 16, 8)
>> +
>> +#define CALL_TEST(M,S,A) test_rot_##M##_##S##_##A ()
>> +
>> +int
>> +main (void)
>> +{
>> + CALL_TEST (v2di, 64, 56);
>> + CALL_TEST (v2di, 64, 48);
>> + CALL_TEST (v2di, 64, 40);
>> + CALL_TEST (v2di, 64, 32);
>> + CALL_TEST (v2di, 64, 24);
>> + CALL_TEST (v2di, 64, 16);
>> + CALL_TEST (v2di, 64, 8);
>> +
>> + CALL_TEST (v4si, 32, 24);
>> + CALL_TEST (v4si, 32, 16);
>> + CALL_TEST (v4si, 32, 8);
>> +
>> + CALL_TEST (v8hi, 16, 8);
>> +
>> + CALL_TEST (v2si, 32, 24);
>> + CALL_TEST (v2si, 32, 16);
>> + CALL_TEST (v2si, 32, 8);
>> +
>> + CALL_TEST (v4hi, 16, 8);
>> +
>> + return 0;
>> +}
>> +