Re: [PATCH 3/7] aarch64: Handle DImode BCAX operations

Richard Sandiford Tue, 08 Jul 2025 02:07:26 -0700

Tamar Christina <tamar.christ...@arm.com> writes:
>> -----Original Message-----
>> From: Richard Sandiford <richard.sandif...@arm.com>
>> Sent: Monday, July 7, 2025 12:55 PM
>> To: Kyrylo Tkachov <ktkac...@nvidia.com>
>> Cc: GCC Patches <gcc-patches@gcc.gnu.org>; Richard Earnshaw
>> <richard.earns...@arm.com>; Alex Coplan <alex.cop...@arm.com>; Andrew
>> Pinski <pins...@gmail.com>
>> Subject: Re: [PATCH 3/7] aarch64: Handle DImode BCAX operations
>> 
>> Richard Sandiford <richard.sandif...@arm.com> writes:
>> > Kyrylo Tkachov <ktkac...@nvidia.com> writes:
>> >> Hi all,
>> >>
>> >> To handle DImode BCAX operations we want to do them on the SIMD side only
>> if
>> >> the incoming arguments don't require a cross-bank move.
>> >> This means we need to split back the combination to separate GP BIC+EOR
>> >> instructions if the operands are expected to be in GP regs through reload.
>> >> The split happens pre-reload if we already know that the destination will 
>> >> be
>> >> a GP reg. Otherwise if reload descides to use the "=r,r" alternative we 
>> >> ensure
>> >> operand 0 is early-clobber.
>> >> This scheme is similar to how we handle the BSL operations elsewhere in
>> >> aarch64-simd.md.
>> >>
>> >> Thus, for the functions:
>> >> uint64_t bcax_d_gp (uint64_t a, uint64_t b, uint64_t c) { return BCAX (a, 
>> >> b, c); }
>> >> uint64x1_t bcax_d (uint64x1_t a, uint64x1_t b, uint64x1_t c) { return 
>> >> BCAX (a,
>> b, c); }
>> >>
>> >> we now generate the desired:
>> >> bcax_d_gp:
>> >> bic x1, x1, x2
>> >> eor x0, x1, x0
>> >> ret
>> >>
>> >> bcax_d:
>> >> bcax v0.16b, v0.16b, v1.16b, v2.16b
>> >> ret
>> >>
>> >> When the inputs are in SIMD regs we use BCAX and when they are in GP regs 
>> >> we
>> >> don't force them to SIMD with extra moves.
>> >>
>> >> Bootstrapped and tested on aarch64-none-linux-gnu.
>> >> Ok for trunk?
>> >> Thanks,
>> >> Kyrill
>> >>
>> >> Signed-off-by: Kyrylo Tkachov <ktkac...@nvidia.com>
>> >>
>> >> gcc/
>> >>
>> >>   * config/aarch64/aarch64-simd.md (*bcaxqdi4): New
>> >>   define_insn_and_split.
>> >>
>> >> gcc/testsuite/
>> >>
>> >>   * gcc.target/aarch64/simd/bcax_d.c: Add tests for DImode arguments.
>> >>
>> >> From 95268cff1261a7724190dd291f9fcb5a7c817917 Mon Sep 17 00:00:00
>> 2001
>> >> From: Kyrylo Tkachov <ktkac...@nvidia.com>
>> >> Date: Thu, 3 Jul 2025 09:45:02 -0700
>> >> Subject: [PATCH 3/7] aarch64: Handle DImode BCAX operations
>> >>
>> >> To handle DImode BCAX operations we want to do them on the SIMD side only
>> if
>> >> the incoming arguments don't require a cross-bank move.
>> >> This means we need to split back the combination to separate GP BIC+EOR
>> >> instructions if the operands are expected to be in GP regs through reload.
>> >> The split happens pre-reload if we already know that the destination will 
>> >> be
>> >> a GP reg.  Otherwise if reload descides to use the "=r,r" alternative we 
>> >> ensure
>> >> operand 0 is early-clobber.
>> >> This scheme is similar to how we handle the BSL operations elsewhere in
>> >> aarch64-simd.md.
>> >>
>> >> Thus, for the functions:
>> >> uint64_t bcax_d_gp (uint64_t a, uint64_t b, uint64_t c) { return BCAX (a, 
>> >> b, c); }
>> >> uint64x1_t bcax_d (uint64x1_t a, uint64x1_t b, uint64x1_t c) { return 
>> >> BCAX (a,
>> b, c); }
>> >>
>> >> we now generate the desired:
>> >> bcax_d_gp:
>> >>         bic     x1, x1, x2
>> >>         eor     x0, x1, x0
>> >>         ret
>> >>
>> >> bcax_d:
>> >>         bcax    v0.16b, v0.16b, v1.16b, v2.16b
>> >>         ret
>> >>
>> >> When the inputs are in SIMD regs we use BCAX and when they are in GP regs 
>> >> we
>> >> don't force them to SIMD with extra moves.
>> >>
>> >> Bootstrapped and tested on aarch64-none-linux-gnu.
>> >>
>> >> Signed-off-by: Kyrylo Tkachov <ktkac...@nvidia.com>
>> >>
>> >> gcc/
>> >>
>> >>   * config/aarch64/aarch64-simd.md (*bcaxqdi4): New
>> >>   define_insn_and_split.
>> >>
>> >> gcc/testsuite/
>> >>
>> >>   * gcc.target/aarch64/simd/bcax_d.c: Add tests for DImode arguments.
>> >> ---
>> >>  gcc/config/aarch64/aarch64-simd.md            | 29 +++++++++++++++++++
>> >>  .../gcc.target/aarch64/simd/bcax_d.c          |  6 +++-
>> >>  2 files changed, 34 insertions(+), 1 deletion(-)
>> >>
>> >> diff --git a/gcc/config/aarch64/aarch64-simd.md
>> b/gcc/config/aarch64/aarch64-simd.md
>> >> index 4493e55603d..be6a16b4be8 100644
>> >> --- a/gcc/config/aarch64/aarch64-simd.md
>> >> +++ b/gcc/config/aarch64/aarch64-simd.md
>> >> @@ -9252,6 +9252,35 @@
>> >>    [(set_attr "type" "crypto_sha3")]
>> >>  )
>> >>
>> >> +(define_insn_and_split "*bcaxqdi4"
>> >> +  [(set (match_operand:DI 0 "register_operand" "=w,&r")
>> >> + (xor:DI
>> >> +   (and:DI
>> >> +     (not:DI (match_operand:DI 3 "register_operand" "w,r"))
>> >> +     (match_operand:DI 2 "register_operand" "w,r"))
>> >> +   (match_operand:DI 1 "register_operand" "w,r")))]
>> >
>> > I think the constraint on operand 1 should be "w,r0", so that we allow
>> > operand 1 to be the same as operand 0.  Without that, and with split1
>> > disabled/sidelined, we would end up with an extra move for:
>> >
>> >   uint64_t f(uint64_t x0, uint64_t x1, uint64_t x2) {
>> >     return x0 ^ (x1 & ~x2);
>> >   }
>> >
>> > (The only reason split1 avoids the extra move is that combine combines
>> > the hard register copy into the *bcaxqdi4, which is a bit dubious from
>> > an RA perspective.)
>> 
>> Sigh.  Wrong way round, of course: it's operands 2 and 3 that can be "w,r0".
>> 
>
> Question for my own understanding. From an RA perspective can the tie end up
> with the same cost as the r? I was wondering whether w,0r or w,r0 makes a 
> difference.


Hmm, good question.  It turns out that the costings for "0r" and "r0"
are different: the costing for "0r" sums both the "0" costs and the "r"
costs, whereas the costing for "r0" in the same as for "r".
(See ira-costs.cc:record_reg_classes, where '0'-'9' are handled by
an "if" statement and the other constraints are handled by a following
"while" statement.)

"0" is costed based on the register class of operand 0, so effectively
in the same way as "r".  So I think the effect of costing both "0" and
"r" in "0r" would be double-counting.

However, the costs of allocating a GPR to "r" (or to a "0" bound to "=r")
are 0, so if the alternative is considered in isolation, the double
counting would increase the cost of non-GPR classes while not increasing
the cost of GPR classes.

When an instruction has multiple alternatives, the final cost for each
class is the minimum cost for that class over all alternatives.  So if
the "=r"/"0r" alternative is alongside a "=w"/"w" alternative, the FPR
costs would be taken purely from the "=w"/"w" alternative, and any
double counting in the other alternative would have no effect.
And I don't think there are any allocatable classes outside "r"
and "w" that can store integers.

Still, I think that does mean that "r0" suggested above would be better
than "0r".  It should give the same class costs as just "r".

Thanks,
Richard

Re: [PATCH 3/7] aarch64: Handle DImode BCAX operations

Reply via email to