> On 8 Jul 2025, at 12:39, Tamar Christina <tamar.christ...@arm.com> wrote: > >> -----Original Message----- >> From: Richard Sandiford <richard.sandif...@arm.com> >> Sent: Tuesday, July 8, 2025 10:07 AM >> To: Tamar Christina <tamar.christ...@arm.com> >> Cc: Kyrylo Tkachov <ktkac...@nvidia.com>; GCC Patches <gcc- >> patc...@gcc.gnu.org>; Richard Earnshaw <richard.earns...@arm.com>; Alex >> Coplan <alex.cop...@arm.com>; Andrew Pinski <pins...@gmail.com> >> Subject: Re: [PATCH 3/7] aarch64: Handle DImode BCAX operations >> >> Tamar Christina <tamar.christ...@arm.com> writes: >>>> -----Original Message----- >>>> From: Richard Sandiford <richard.sandif...@arm.com> >>>> Sent: Monday, July 7, 2025 12:55 PM >>>> To: Kyrylo Tkachov <ktkac...@nvidia.com> >>>> Cc: GCC Patches <gcc-patches@gcc.gnu.org>; Richard Earnshaw >>>> <richard.earns...@arm.com>; Alex Coplan <alex.cop...@arm.com>; Andrew >>>> Pinski <pins...@gmail.com> >>>> Subject: Re: [PATCH 3/7] aarch64: Handle DImode BCAX operations >>>> >>>> Richard Sandiford <richard.sandif...@arm.com> writes: >>>>> Kyrylo Tkachov <ktkac...@nvidia.com> writes: >>>>>> Hi all, >>>>>> >>>>>> To handle DImode BCAX operations we want to do them on the SIMD side >> only >>>> if >>>>>> the incoming arguments don't require a cross-bank move. >>>>>> This means we need to split back the combination to separate GP BIC+EOR >>>>>> instructions if the operands are expected to be in GP regs through >>>>>> reload. >>>>>> The split happens pre-reload if we already know that the destination >>>>>> will be >>>>>> a GP reg. Otherwise if reload descides to use the "=r,r" alternative we >>>>>> ensure >>>>>> operand 0 is early-clobber. >>>>>> This scheme is similar to how we handle the BSL operations elsewhere in >>>>>> aarch64-simd.md. >>>>>> >>>>>> Thus, for the functions: >>>>>> uint64_t bcax_d_gp (uint64_t a, uint64_t b, uint64_t c) { return BCAX >>>>>> (a, b, >> c); } >>>>>> uint64x1_t bcax_d (uint64x1_t a, uint64x1_t b, uint64x1_t c) { return >>>>>> BCAX >> (a, >>>> b, c); } >>>>>> >>>>>> we now generate the desired: >>>>>> bcax_d_gp: >>>>>> bic x1, x1, x2 >>>>>> eor x0, x1, x0 >>>>>> ret >>>>>> >>>>>> bcax_d: >>>>>> bcax v0.16b, v0.16b, v1.16b, v2.16b >>>>>> ret >>>>>> >>>>>> When the inputs are in SIMD regs we use BCAX and when they are in GP regs >> we >>>>>> don't force them to SIMD with extra moves. >>>>>> >>>>>> Bootstrapped and tested on aarch64-none-linux-gnu. >>>>>> Ok for trunk? >>>>>> Thanks, >>>>>> Kyrill >>>>>> >>>>>> Signed-off-by: Kyrylo Tkachov <ktkac...@nvidia.com> >>>>>> >>>>>> gcc/ >>>>>> >>>>>> * config/aarch64/aarch64-simd.md (*bcaxqdi4): New >>>>>> define_insn_and_split. >>>>>> >>>>>> gcc/testsuite/ >>>>>> >>>>>> * gcc.target/aarch64/simd/bcax_d.c: Add tests for DImode arguments. >>>>>> >>>>>> From 95268cff1261a7724190dd291f9fcb5a7c817917 Mon Sep 17 >> 00:00:00 >>>> 2001 >>>>>> From: Kyrylo Tkachov <ktkac...@nvidia.com> >>>>>> Date: Thu, 3 Jul 2025 09:45:02 -0700 >>>>>> Subject: [PATCH 3/7] aarch64: Handle DImode BCAX operations >>>>>> >>>>>> To handle DImode BCAX operations we want to do them on the SIMD side >> only >>>> if >>>>>> the incoming arguments don't require a cross-bank move. >>>>>> This means we need to split back the combination to separate GP BIC+EOR >>>>>> instructions if the operands are expected to be in GP regs through >>>>>> reload. >>>>>> The split happens pre-reload if we already know that the destination >>>>>> will be >>>>>> a GP reg. Otherwise if reload descides to use the "=r,r" alternative we >>>>>> ensure >>>>>> operand 0 is early-clobber. >>>>>> This scheme is similar to how we handle the BSL operations elsewhere in >>>>>> aarch64-simd.md. >>>>>> >>>>>> Thus, for the functions: >>>>>> uint64_t bcax_d_gp (uint64_t a, uint64_t b, uint64_t c) { return BCAX >>>>>> (a, b, >> c); } >>>>>> uint64x1_t bcax_d (uint64x1_t a, uint64x1_t b, uint64x1_t c) { return >>>>>> BCAX >> (a, >>>> b, c); } >>>>>> >>>>>> we now generate the desired: >>>>>> bcax_d_gp: >>>>>> bic x1, x1, x2 >>>>>> eor x0, x1, x0 >>>>>> ret >>>>>> >>>>>> bcax_d: >>>>>> bcax v0.16b, v0.16b, v1.16b, v2.16b >>>>>> ret >>>>>> >>>>>> When the inputs are in SIMD regs we use BCAX and when they are in GP regs >> we >>>>>> don't force them to SIMD with extra moves. >>>>>> >>>>>> Bootstrapped and tested on aarch64-none-linux-gnu. >>>>>> >>>>>> Signed-off-by: Kyrylo Tkachov <ktkac...@nvidia.com> >>>>>> >>>>>> gcc/ >>>>>> >>>>>> * config/aarch64/aarch64-simd.md (*bcaxqdi4): New >>>>>> define_insn_and_split. >>>>>> >>>>>> gcc/testsuite/ >>>>>> >>>>>> * gcc.target/aarch64/simd/bcax_d.c: Add tests for DImode arguments. >>>>>> --- >>>>>> gcc/config/aarch64/aarch64-simd.md | 29 +++++++++++++++++++ >>>>>> .../gcc.target/aarch64/simd/bcax_d.c | 6 +++- >>>>>> 2 files changed, 34 insertions(+), 1 deletion(-) >>>>>> >>>>>> diff --git a/gcc/config/aarch64/aarch64-simd.md >>>> b/gcc/config/aarch64/aarch64-simd.md >>>>>> index 4493e55603d..be6a16b4be8 100644 >>>>>> --- a/gcc/config/aarch64/aarch64-simd.md >>>>>> +++ b/gcc/config/aarch64/aarch64-simd.md >>>>>> @@ -9252,6 +9252,35 @@ >>>>>> [(set_attr "type" "crypto_sha3")] >>>>>> ) >>>>>> >>>>>> +(define_insn_and_split "*bcaxqdi4" >>>>>> + [(set (match_operand:DI 0 "register_operand" "=w,&r") >>>>>> + (xor:DI >>>>>> + (and:DI >>>>>> + (not:DI (match_operand:DI 3 "register_operand" "w,r")) >>>>>> + (match_operand:DI 2 "register_operand" "w,r")) >>>>>> + (match_operand:DI 1 "register_operand" "w,r")))] >>>>> >>>>> I think the constraint on operand 1 should be "w,r0", so that we allow >>>>> operand 1 to be the same as operand 0. Without that, and with split1 >>>>> disabled/sidelined, we would end up with an extra move for: >>>>> >>>>> uint64_t f(uint64_t x0, uint64_t x1, uint64_t x2) { >>>>> return x0 ^ (x1 & ~x2); >>>>> } >>>>> >>>>> (The only reason split1 avoids the extra move is that combine combines >>>>> the hard register copy into the *bcaxqdi4, which is a bit dubious from >>>>> an RA perspective.) >>>> >>>> Sigh. Wrong way round, of course: it's operands 2 and 3 that can be >>>> "w,r0". >>>> >>> >>> Question for my own understanding. From an RA perspective can the tie end up >>> with the same cost as the r? I was wondering whether w,0r or w,r0 makes a >> difference. >> >> Hmm, good question. It turns out that the costings for "0r" and "r0" >> are different: the costing for "0r" sums both the "0" costs and the "r" >> costs, whereas the costing for "r0" in the same as for "r". >> (See ira-costs.cc:record_reg_classes, where '0'-'9' are handled by >> an "if" statement and the other constraints are handled by a following >> "while" statement.) >> >> "0" is costed based on the register class of operand 0, so effectively >> in the same way as "r". So I think the effect of costing both "0" and >> "r" in "0r" would be double-counting. >> >> However, the costs of allocating a GPR to "r" (or to a "0" bound to "=r") >> are 0, so if the alternative is considered in isolation, the double >> counting would increase the cost of non-GPR classes while not increasing >> the cost of GPR classes. >> >> When an instruction has multiple alternatives, the final cost for each >> class is the minimum cost for that class over all alternatives. So if >> the "=r"/"0r" alternative is alongside a "=w"/"w" alternative, the FPR >> costs would be taken purely from the "=w"/"w" alternative, and any >> double counting in the other alternative would have no effect. >> And I don't think there are any allocatable classes outside "r" >> and "w" that can store integers. >> >> Still, I think that does mean that "r0" suggested above would be better >> than "0r". It should give the same class costs as just "r". >> > > That's a surprising difference! Thanks for explaining. Yeah I agree that r0 > is better.
Thanks for your comments, do you mean something like the following? Or do you mean to have separate alternatives with each one individually tying one of operands 2 or 3 to r0? Kyrill > > Thanks, > Tamar > >> Thanks, >> Richard
v2-0003-aarch64-Handle-DImode-BCAX-operations.patch
Description: v2-0003-aarch64-Handle-DImode-BCAX-operations.patch