Hi all,

The intent of the patch is similar to previous in the series.
Make more use of BSL2N when we have DImode operands in SIMD regs,
but still use the GP instructions when that's where the operands are.
Compared to the previous patches there are a couple of complications:
* The operands are a bit more complex and get rejected by RTX costs during
combine. This is fixed by adding some costing logic to aarch64_rtx_costs.

* The GP split sequence requires two temporaries instead of just one.
I've marked operand 1 to be an input/output earlyclobber operand to give
the second temporary together with the earlyclobber operand 0. This means
that operand is marked with "+" even for the "w" alternatives as the modifier
is global, but I don't see another way out here. Suggestions welcome.

With these fixed for the testcase we generate:
bsl2n_gp: // unchanged scalar output
orr x1, x2, x1
and x0, x0, x2
orn x0, x0, x1
ret

bsl2n_d:
bsl2n z0.d, z0.d, z1.d, z2.d
ret

compared to the previous:
bsl2n_gp:
orr x1, x2, x1
and x0, x0, x2
orn x0, x0, x1
ret

bsl2n_d:
orr v1.8b, v2.8b, v1.8b
and v0.8b, v2.8b, v0.8b
orn v0.8b, v0.8b, v1.8b
ret

Bootstrapped and tested on aarch64-none-linux-gnu.
Ok for trunk?
Thanks,
Kyrill

Signed-off-by: Kyrylo Tkachov <ktkac...@nvidia.com>

gcc/

        * config/aarch64/aarch64-sve2.md (*aarch64_sve2_bsl2n_unpreddi): New
        define_insn_and_split.
        * config/aarch64/aarch64.cc (aarch64_bsl2n_rtx_form_p): Define.
        (aarch64_rtx_costs): Use the above. Cost BSL2N ops.

gcc/testsuite/

        * gcc.target/aarch64/sve2/bsl2n_d.c: New test.

Attachment: 0007-aarch64-Use-BSL2N-for-DImode-operands.patch
Description: 0007-aarch64-Use-BSL2N-for-DImode-operands.patch

Reply via email to