https://gcc.gnu.org/bugzilla/show_bug.cgi?id=115258

kugan at gcc dot gnu.org changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
                 CC|                            |kugan at gcc dot gnu.org

--- Comment #6 from kugan at gcc dot gnu.org ---
This (In reply to GCC Commits from comment #3)
> The trunk branch has been updated by Richard Sandiford
> <rsand...@gcc.gnu.org>:
> 
> https://gcc.gnu.org/g:39263ed2d39ac1cebde59bc5e72ddcad5dc7a1ec
> 
> commit r15-906-g39263ed2d39ac1cebde59bc5e72ddcad5dc7a1ec
> Author: Richard Sandiford <richard.sandif...@arm.com>
> Date:   Wed May 29 16:43:33 2024 +0100
> 
>     aarch64: Split aarch64_combinev16qi before RA [PR115258]
>     
>     Two-vector TBL instructions are fed by an aarch64_combinev16qi, whose
>     purpose is to put the two input data vectors into consecutive registers.
>     This aarch64_combinev16qi was then split after reload into individual
>     moves (from the first input to the first half of the output, and from
>     the second input to the second half of the output).
>     
>     In the worst case, the RA might allocate things so that the destination
>     of the aarch64_combinev16qi is the second input followed by the first
>     input.  In that case, the split form of aarch64_combinev16qi uses three
>     eors to swap the registers around.
>     
>     This PR is about a test where this worst case occurred.  And given the
>     insn description, that allocation doesn't semm unreasonable.
>     
>     early-ra should (hopefully) mean that we're now better at allocating
>     subregs of vector registers.  The upcoming RA subreg patches should
>     improve things further.  The best fix for the PR therefore seems
>     to be to split the combination before RA, so that the RA can see
>     the underlying moves.
>     
>     Perhaps it even makes sense to do this at expand time, avoiding the need
>     for aarch64_combinev16qi entirely.  That deserves more experimentation
>     though.
>     
>     gcc/
>             PR target/115258
>             * config/aarch64/aarch64-simd.md (aarch64_combinev16qi): Allow
>             the split before reload.
>             * config/aarch64/aarch64.cc (aarch64_split_combinev16qi):
> Generalize
>             into a form that handles pseudo registers.
>     
>     gcc/testsuite/
>             PR target/115258
>             * gcc.target/aarch64/pr115258.c: New test.

This is causing performance regression in some TSVC kernels and others. Here is
an example:
https://godbolt.org/z/r91nYEEsP

We now get:
.L3:
        add     x3, x26, x0
        add     x2, x25, x0
        add     x3, x3, 65536
        add     x2, x2, 65536
        sub     x0, x0, #16
        ldr     q31, [x3, 62448]
        mov     v28.16b, v31.16b
        mov     v29.16b, v31.16b
        tbl     v31.16b, {v28.16b - v29.16b}, v30.16b
        fadd    v31.4s, v31.4s, v25.4s
        mov     v26.16b, v31.16b
        mov     v27.16b, v31.16b
        tbl     v31.16b, {v26.16b - v27.16b}, v30.16b
        str     q31, [x2, 62448]
        cmp     x0, x27
        bne     .L3

Reply via email to