https://gcc.gnu.org/bugzilla/show_bug.cgi?id=90579
Richard Biener <rguenth at gcc dot gnu.org> changed: What |Removed |Added ---------------------------------------------------------------------------- Keywords| |missed-optimization Status|UNCONFIRMED |NEW Last reconfirmed| |2019-05-23 CC| |rguenth at gcc dot gnu.org Ever confirmed|0 |1 --- Comment #1 from Richard Biener <rguenth at gcc dot gnu.org> --- .cfi_startproc movslq %edi, %rax addl $4, %edi vbroadcastsd %xmm0, %ymm1 movslq %edi, %rdi vmovddup %xmm0, %xmm0 vmulpd a(,%rax,8), %ymm1, %ymm1 vmulpd a(,%rdi,8), %xmm0, %xmm0 vmovupd %ymm1, r(%rip) vmovups %xmm0, r+32(%rip) vmovupd r+16(%rip), %ymm1 ^^^ this one vextractf128 $0x1, %ymm1, %xmm2 vunpckhpd %xmm2, %xmm2, %xmm0 vaddsd .LC0(%rip), %xmm0, %xmm0 vaddsd %xmm2, %xmm0, %xmm0 vunpckhpd %xmm1, %xmm1, %xmm2 vaddsd %xmm2, %xmm0, %xmm0 vaddsd %xmm1, %xmm0, %xmm0 vaddsd r+8(%rip), %xmm0, %xmm0 vaddsd r(%rip), %xmm0, %xmm0 vzeroupper ret unaligned accesses are prone to STLF issues but there's no easy way out here, at least I don't see a good way of say, restricting the 2nd loop vectorization to SSE. Note when misaligning by a single element we'd have to disable vectorization completely. In some way this is a target issue since it allows unaligned loads. If it would split them (we have a tunable for this) we'd be fine here (by luck, until misalinging not by SSE vector size). Similar cases can be made with placing unvectorized by element initializations before a vectorized loop (possibly in another function). Those STLF issues are just a bad "feature" of modern CPUs and the fix is ultimatively in them...