[Bug tree-optimization/90579] Huge store forward stall due to vectorizer

rguenth at gcc dot gnu.org Thu, 23 May 2019 03:33:00 -0700

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=90579


Richard Biener <rguenth at gcc dot gnu.org> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
           Keywords|                            |missed-optimization
             Status|UNCONFIRMED                 |NEW
   Last reconfirmed|                            |2019-05-23
                 CC|                            |rguenth at gcc dot gnu.org
     Ever confirmed|0                           |1

--- Comment #1 from Richard Biener <rguenth at gcc dot gnu.org> ---
        .cfi_startproc
        movslq  %edi, %rax
        addl    $4, %edi
        vbroadcastsd    %xmm0, %ymm1
        movslq  %edi, %rdi
        vmovddup        %xmm0, %xmm0
        vmulpd  a(,%rax,8), %ymm1, %ymm1
        vmulpd  a(,%rdi,8), %xmm0, %xmm0
        vmovupd %ymm1, r(%rip)
        vmovups %xmm0, r+32(%rip)
        vmovupd r+16(%rip), %ymm1
^^^ this one

        vextractf128    $0x1, %ymm1, %xmm2
        vunpckhpd       %xmm2, %xmm2, %xmm0
        vaddsd  .LC0(%rip), %xmm0, %xmm0
        vaddsd  %xmm2, %xmm0, %xmm0
        vunpckhpd       %xmm1, %xmm1, %xmm2
        vaddsd  %xmm2, %xmm0, %xmm0
        vaddsd  %xmm1, %xmm0, %xmm0
        vaddsd  r+8(%rip), %xmm0, %xmm0
        vaddsd  r(%rip), %xmm0, %xmm0
        vzeroupper
        ret

unaligned accesses are prone to STLF issues but there's no easy
way out here, at least I don't see a good way of say, restricting
the 2nd loop vectorization to SSE.

Note when misaligning by a single element we'd have to disable
vectorization completely.

In some way this is a target issue since it allows unaligned
loads.  If it would split them (we have a tunable for this)
we'd be fine here (by luck, until misalinging not by SSE vector size).

Similar cases can be made with placing unvectorized by element initializations
before a vectorized loop (possibly in another function).

Those STLF issues are just a bad "feature" of modern CPUs and the fix
is ultimatively in them...

[Bug tree-optimization/90579] Huge store forward stall due to vectorizer

Reply via email to