Re: [PATCH 3/7] aarch64: Handle DImode BCAX operations

Kyrylo Tkachov Tue, 08 Jul 2025 06:36:57 -0700


> On 8 Jul 2025, at 12:39, Tamar Christina <tamar.christ...@arm.com> wrote:
> 
>> -----Original Message-----
>> From: Richard Sandiford <richard.sandif...@arm.com>
>> Sent: Tuesday, July 8, 2025 10:07 AM
>> To: Tamar Christina <tamar.christ...@arm.com>
>> Cc: Kyrylo Tkachov <ktkac...@nvidia.com>; GCC Patches <gcc-
>> patc...@gcc.gnu.org>; Richard Earnshaw <richard.earns...@arm.com>; Alex
>> Coplan <alex.cop...@arm.com>; Andrew Pinski <pins...@gmail.com>
>> Subject: Re: [PATCH 3/7] aarch64: Handle DImode BCAX operations
>> 
>> Tamar Christina <tamar.christ...@arm.com> writes:
>>>> -----Original Message-----
>>>> From: Richard Sandiford <richard.sandif...@arm.com>
>>>> Sent: Monday, July 7, 2025 12:55 PM
>>>> To: Kyrylo Tkachov <ktkac...@nvidia.com>
>>>> Cc: GCC Patches <gcc-patches@gcc.gnu.org>; Richard Earnshaw
>>>> <richard.earns...@arm.com>; Alex Coplan <alex.cop...@arm.com>; Andrew
>>>> Pinski <pins...@gmail.com>
>>>> Subject: Re: [PATCH 3/7] aarch64: Handle DImode BCAX operations
>>>> 
>>>> Richard Sandiford <richard.sandif...@arm.com> writes:
>>>>> Kyrylo Tkachov <ktkac...@nvidia.com> writes:
>>>>>> Hi all,
>>>>>> 
>>>>>> To handle DImode BCAX operations we want to do them on the SIMD side
>> only
>>>> if
>>>>>> the incoming arguments don't require a cross-bank move.
>>>>>> This means we need to split back the combination to separate GP BIC+EOR
>>>>>> instructions if the operands are expected to be in GP regs through 
>>>>>> reload.
>>>>>> The split happens pre-reload if we already know that the destination 
>>>>>> will be
>>>>>> a GP reg. Otherwise if reload descides to use the "=r,r" alternative we 
>>>>>> ensure
>>>>>> operand 0 is early-clobber.
>>>>>> This scheme is similar to how we handle the BSL operations elsewhere in
>>>>>> aarch64-simd.md.
>>>>>> 
>>>>>> Thus, for the functions:
>>>>>> uint64_t bcax_d_gp (uint64_t a, uint64_t b, uint64_t c) { return BCAX 
>>>>>> (a, b,
>> c); }
>>>>>> uint64x1_t bcax_d (uint64x1_t a, uint64x1_t b, uint64x1_t c) { return 
>>>>>> BCAX
>> (a,
>>>> b, c); }
>>>>>> 
>>>>>> we now generate the desired:
>>>>>> bcax_d_gp:
>>>>>> bic x1, x1, x2
>>>>>> eor x0, x1, x0
>>>>>> ret
>>>>>> 
>>>>>> bcax_d:
>>>>>> bcax v0.16b, v0.16b, v1.16b, v2.16b
>>>>>> ret
>>>>>> 
>>>>>> When the inputs are in SIMD regs we use BCAX and when they are in GP regs
>> we
>>>>>> don't force them to SIMD with extra moves.
>>>>>> 
>>>>>> Bootstrapped and tested on aarch64-none-linux-gnu.
>>>>>> Ok for trunk?
>>>>>> Thanks,
>>>>>> Kyrill
>>>>>> 
>>>>>> Signed-off-by: Kyrylo Tkachov <ktkac...@nvidia.com>
>>>>>> 
>>>>>> gcc/
>>>>>> 
>>>>>> * config/aarch64/aarch64-simd.md (*bcaxqdi4): New
>>>>>> define_insn_and_split.
>>>>>> 
>>>>>> gcc/testsuite/
>>>>>> 
>>>>>> * gcc.target/aarch64/simd/bcax_d.c: Add tests for DImode arguments.
>>>>>> 
>>>>>> From 95268cff1261a7724190dd291f9fcb5a7c817917 Mon Sep 17
>> 00:00:00
>>>> 2001
>>>>>> From: Kyrylo Tkachov <ktkac...@nvidia.com>
>>>>>> Date: Thu, 3 Jul 2025 09:45:02 -0700
>>>>>> Subject: [PATCH 3/7] aarch64: Handle DImode BCAX operations
>>>>>> 
>>>>>> To handle DImode BCAX operations we want to do them on the SIMD side
>> only
>>>> if
>>>>>> the incoming arguments don't require a cross-bank move.
>>>>>> This means we need to split back the combination to separate GP BIC+EOR
>>>>>> instructions if the operands are expected to be in GP regs through 
>>>>>> reload.
>>>>>> The split happens pre-reload if we already know that the destination 
>>>>>> will be
>>>>>> a GP reg.  Otherwise if reload descides to use the "=r,r" alternative we 
>>>>>> ensure
>>>>>> operand 0 is early-clobber.
>>>>>> This scheme is similar to how we handle the BSL operations elsewhere in
>>>>>> aarch64-simd.md.
>>>>>> 
>>>>>> Thus, for the functions:
>>>>>> uint64_t bcax_d_gp (uint64_t a, uint64_t b, uint64_t c) { return BCAX 
>>>>>> (a, b,
>> c); }
>>>>>> uint64x1_t bcax_d (uint64x1_t a, uint64x1_t b, uint64x1_t c) { return 
>>>>>> BCAX
>> (a,
>>>> b, c); }
>>>>>> 
>>>>>> we now generate the desired:
>>>>>> bcax_d_gp:
>>>>>>        bic     x1, x1, x2
>>>>>>        eor     x0, x1, x0
>>>>>>        ret
>>>>>> 
>>>>>> bcax_d:
>>>>>>        bcax    v0.16b, v0.16b, v1.16b, v2.16b
>>>>>>        ret
>>>>>> 
>>>>>> When the inputs are in SIMD regs we use BCAX and when they are in GP regs
>> we
>>>>>> don't force them to SIMD with extra moves.
>>>>>> 
>>>>>> Bootstrapped and tested on aarch64-none-linux-gnu.
>>>>>> 
>>>>>> Signed-off-by: Kyrylo Tkachov <ktkac...@nvidia.com>
>>>>>> 
>>>>>> gcc/
>>>>>> 
>>>>>> * config/aarch64/aarch64-simd.md (*bcaxqdi4): New
>>>>>> define_insn_and_split.
>>>>>> 
>>>>>> gcc/testsuite/
>>>>>> 
>>>>>> * gcc.target/aarch64/simd/bcax_d.c: Add tests for DImode arguments.
>>>>>> ---
>>>>>> gcc/config/aarch64/aarch64-simd.md            | 29 +++++++++++++++++++
>>>>>> .../gcc.target/aarch64/simd/bcax_d.c          |  6 +++-
>>>>>> 2 files changed, 34 insertions(+), 1 deletion(-)
>>>>>> 
>>>>>> diff --git a/gcc/config/aarch64/aarch64-simd.md
>>>> b/gcc/config/aarch64/aarch64-simd.md
>>>>>> index 4493e55603d..be6a16b4be8 100644
>>>>>> --- a/gcc/config/aarch64/aarch64-simd.md
>>>>>> +++ b/gcc/config/aarch64/aarch64-simd.md
>>>>>> @@ -9252,6 +9252,35 @@
>>>>>>   [(set_attr "type" "crypto_sha3")]
>>>>>> )
>>>>>> 
>>>>>> +(define_insn_and_split "*bcaxqdi4"
>>>>>> +  [(set (match_operand:DI 0 "register_operand" "=w,&r")
>>>>>> + (xor:DI
>>>>>> +   (and:DI
>>>>>> +     (not:DI (match_operand:DI 3 "register_operand" "w,r"))
>>>>>> +     (match_operand:DI 2 "register_operand" "w,r"))
>>>>>> +   (match_operand:DI 1 "register_operand" "w,r")))]
>>>>> 
>>>>> I think the constraint on operand 1 should be "w,r0", so that we allow
>>>>> operand 1 to be the same as operand 0.  Without that, and with split1
>>>>> disabled/sidelined, we would end up with an extra move for:
>>>>> 
>>>>>  uint64_t f(uint64_t x0, uint64_t x1, uint64_t x2) {
>>>>>    return x0 ^ (x1 & ~x2);
>>>>>  }
>>>>> 
>>>>> (The only reason split1 avoids the extra move is that combine combines
>>>>> the hard register copy into the *bcaxqdi4, which is a bit dubious from
>>>>> an RA perspective.)
>>>> 
>>>> Sigh.  Wrong way round, of course: it's operands 2 and 3 that can be 
>>>> "w,r0".
>>>> 
>>> 
>>> Question for my own understanding. From an RA perspective can the tie end up
>>> with the same cost as the r? I was wondering whether w,0r or w,r0 makes a
>> difference.
>> 
>> Hmm, good question.  It turns out that the costings for "0r" and "r0"
>> are different: the costing for "0r" sums both the "0" costs and the "r"
>> costs, whereas the costing for "r0" in the same as for "r".
>> (See ira-costs.cc:record_reg_classes, where '0'-'9' are handled by
>> an "if" statement and the other constraints are handled by a following
>> "while" statement.)
>> 
>> "0" is costed based on the register class of operand 0, so effectively
>> in the same way as "r".  So I think the effect of costing both "0" and
>> "r" in "0r" would be double-counting.
>> 
>> However, the costs of allocating a GPR to "r" (or to a "0" bound to "=r")
>> are 0, so if the alternative is considered in isolation, the double
>> counting would increase the cost of non-GPR classes while not increasing
>> the cost of GPR classes.
>> 
>> When an instruction has multiple alternatives, the final cost for each
>> class is the minimum cost for that class over all alternatives.  So if
>> the "=r"/"0r" alternative is alongside a "=w"/"w" alternative, the FPR
>> costs would be taken purely from the "=w"/"w" alternative, and any
>> double counting in the other alternative would have no effect.
>> And I don't think there are any allocatable classes outside "r"
>> and "w" that can store integers.
>> 
>> Still, I think that does mean that "r0" suggested above would be better
>> than "0r".  It should give the same class costs as just "r".
>> 
> 
> That's a surprising difference! Thanks for explaining.  Yeah I agree that r0
> is better.


Thanks for your comments, do you mean something like the following?
Or do you mean to have separate alternatives with each one individually tying 
one of operands 2 or 3 to r0?

Kyrill


> 
> Thanks,
> Tamar
> 
>> Thanks,
>> Richard

v2-0003-aarch64-Handle-DImode-BCAX-operations.patch
Description: v2-0003-aarch64-Handle-DImode-BCAX-operations.patch

Re: [PATCH 3/7] aarch64: Handle DImode BCAX operations

Reply via email to