https://gcc.gnu.org/bugzilla/show_bug.cgi?id=115258
kugan at gcc dot gnu.org changed:
What |Removed |Added
----------------------------------------------------------------------------
CC| |kugan at gcc dot gnu.org
--- Comment #6 from kugan at gcc dot gnu.org ---
This (In reply to GCC Commits from comment #3)
> The trunk branch has been updated by Richard Sandiford
> <[email protected]>:
>
> https://gcc.gnu.org/g:39263ed2d39ac1cebde59bc5e72ddcad5dc7a1ec
>
> commit r15-906-g39263ed2d39ac1cebde59bc5e72ddcad5dc7a1ec
> Author: Richard Sandiford <[email protected]>
> Date: Wed May 29 16:43:33 2024 +0100
>
> aarch64: Split aarch64_combinev16qi before RA [PR115258]
>
> Two-vector TBL instructions are fed by an aarch64_combinev16qi, whose
> purpose is to put the two input data vectors into consecutive registers.
> This aarch64_combinev16qi was then split after reload into individual
> moves (from the first input to the first half of the output, and from
> the second input to the second half of the output).
>
> In the worst case, the RA might allocate things so that the destination
> of the aarch64_combinev16qi is the second input followed by the first
> input. In that case, the split form of aarch64_combinev16qi uses three
> eors to swap the registers around.
>
> This PR is about a test where this worst case occurred. And given the
> insn description, that allocation doesn't semm unreasonable.
>
> early-ra should (hopefully) mean that we're now better at allocating
> subregs of vector registers. The upcoming RA subreg patches should
> improve things further. The best fix for the PR therefore seems
> to be to split the combination before RA, so that the RA can see
> the underlying moves.
>
> Perhaps it even makes sense to do this at expand time, avoiding the need
> for aarch64_combinev16qi entirely. That deserves more experimentation
> though.
>
> gcc/
> PR target/115258
> * config/aarch64/aarch64-simd.md (aarch64_combinev16qi): Allow
> the split before reload.
> * config/aarch64/aarch64.cc (aarch64_split_combinev16qi):
> Generalize
> into a form that handles pseudo registers.
>
> gcc/testsuite/
> PR target/115258
> * gcc.target/aarch64/pr115258.c: New test.
This is causing performance regression in some TSVC kernels and others. Here is
an example:
https://godbolt.org/z/r91nYEEsP
We now get:
.L3:
add x3, x26, x0
add x2, x25, x0
add x3, x3, 65536
add x2, x2, 65536
sub x0, x0, #16
ldr q31, [x3, 62448]
mov v28.16b, v31.16b
mov v29.16b, v31.16b
tbl v31.16b, {v28.16b - v29.16b}, v30.16b
fadd v31.4s, v31.4s, v25.4s
mov v26.16b, v31.16b
mov v27.16b, v31.16b
tbl v31.16b, {v26.16b - v27.16b}, v30.16b
str q31, [x2, 62448]
cmp x0, x27
bne .L3