[Bug tree-optimization/99412] s352 benchmark of TSVC is vectorized by clang and not by gcc

rguenth at gcc dot gnu.org via Gcc-bugs Thu, 12 Jan 2023 05:42:38 -0800

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=99412


Richard Biener <rguenth at gcc dot gnu.org> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
             Status|ASSIGNED                    |RESOLVED
         Resolution|---                         |FIXED

--- Comment #6 from Richard Biener <rguenth at gcc dot gnu.org> ---
GCC now does

.L2:
        vmovaps b(%rax), %xmm6
        vmulps  a(%rax), %xmm6, %xmm0
        addq    $80, %rax
        vmovaps b-64(%rax), %xmm7
        vmovaps b-48(%rax), %xmm6
        vaddps  %xmm0, %xmm5, %xmm5
        vmulps  a-64(%rax), %xmm7, %xmm0
        vmovaps b-32(%rax), %xmm7
        vaddps  %xmm0, %xmm1, %xmm1
        vmulps  a-48(%rax), %xmm6, %xmm0
        vmovaps b-16(%rax), %xmm6
        vaddps  %xmm0, %xmm4, %xmm4
        vmulps  a-32(%rax), %xmm7, %xmm0
        vaddps  %xmm0, %xmm2, %xmm2
        vmulps  a-16(%rax), %xmm6, %xmm0
        vaddps  %xmm0, %xmm3, %xmm3
        cmpq    $128000, %rax
        jne     .L2

thus uses a VF of 4 with -Ofast.  Your LLVM snippet uses 5 lanes
instead of our 20 with four lanes in a V4SF and one lane in a scalar.
That's interesting but not something we support.

Re-rolling would mean using a single v4sf 4 lane vector here.  For
a pure SLP loop something like this should be possible without too
much hassle I think.  We'd just need to try ... (and think of if it's
worth in real life)

For the purpose of the Summary this is fixed now.

[Bug tree-optimization/99412] s352 benchmark of TSVC is vectorized by clang and not by gcc

Reply via email to