https://gcc.gnu.org/bugzilla/show_bug.cgi?id=115258
kugan at gcc dot gnu.org changed: What |Removed |Added ---------------------------------------------------------------------------- CC| |kugan at gcc dot gnu.org --- Comment #6 from kugan at gcc dot gnu.org --- This (In reply to GCC Commits from comment #3) > The trunk branch has been updated by Richard Sandiford > <rsand...@gcc.gnu.org>: > > https://gcc.gnu.org/g:39263ed2d39ac1cebde59bc5e72ddcad5dc7a1ec > > commit r15-906-g39263ed2d39ac1cebde59bc5e72ddcad5dc7a1ec > Author: Richard Sandiford <richard.sandif...@arm.com> > Date: Wed May 29 16:43:33 2024 +0100 > > aarch64: Split aarch64_combinev16qi before RA [PR115258] > > Two-vector TBL instructions are fed by an aarch64_combinev16qi, whose > purpose is to put the two input data vectors into consecutive registers. > This aarch64_combinev16qi was then split after reload into individual > moves (from the first input to the first half of the output, and from > the second input to the second half of the output). > > In the worst case, the RA might allocate things so that the destination > of the aarch64_combinev16qi is the second input followed by the first > input. In that case, the split form of aarch64_combinev16qi uses three > eors to swap the registers around. > > This PR is about a test where this worst case occurred. And given the > insn description, that allocation doesn't semm unreasonable. > > early-ra should (hopefully) mean that we're now better at allocating > subregs of vector registers. The upcoming RA subreg patches should > improve things further. The best fix for the PR therefore seems > to be to split the combination before RA, so that the RA can see > the underlying moves. > > Perhaps it even makes sense to do this at expand time, avoiding the need > for aarch64_combinev16qi entirely. That deserves more experimentation > though. > > gcc/ > PR target/115258 > * config/aarch64/aarch64-simd.md (aarch64_combinev16qi): Allow > the split before reload. > * config/aarch64/aarch64.cc (aarch64_split_combinev16qi): > Generalize > into a form that handles pseudo registers. > > gcc/testsuite/ > PR target/115258 > * gcc.target/aarch64/pr115258.c: New test. This is causing performance regression in some TSVC kernels and others. Here is an example: https://godbolt.org/z/r91nYEEsP We now get: .L3: add x3, x26, x0 add x2, x25, x0 add x3, x3, 65536 add x2, x2, 65536 sub x0, x0, #16 ldr q31, [x3, 62448] mov v28.16b, v31.16b mov v29.16b, v31.16b tbl v31.16b, {v28.16b - v29.16b}, v30.16b fadd v31.4s, v31.4s, v25.4s mov v26.16b, v31.16b mov v27.16b, v31.16b tbl v31.16b, {v26.16b - v27.16b}, v30.16b str q31, [x2, 62448] cmp x0, x27 bne .L3