https://gcc.gnu.org/bugzilla/show_bug.cgi?id=90579
Richard Biener <rguenth at gcc dot gnu.org> changed: What |Removed |Added ---------------------------------------------------------------------------- Last reconfirmed|2019-05-23 00:00:00 |2025-2-12 CC| |konstantinos.eleftheriou@vr | |ull.eu, law at gcc dot gnu.org --- Comment #19 from Richard Biener <rguenth at gcc dot gnu.org> --- Assembly with -O3 -march=skylake is still loop: .LFB0: .cfi_startproc movslq %edi, %rdi vbroadcastsd %xmm0, %ymm1 vmovddup %xmm0, %xmm0 vmulpd a(,%rdi,8), %ymm1, %ymm1 vxorpd %xmm4, %xmm4, %xmm4 vmovupd %ymm1, r(%rip) <--- Offsetted full store vmulpd a+32(,%rdi,8), %xmm0, %xmm0 vmovupd %xmm0, r+32(%rip) <--- Store upper half vmovupd r+16(%rip), %ymm2 <--- STLF fail vextractf128 $0x1, %ymm2, %xmm3 vunpckhpd %xmm3, %xmm3, %xmm0 vaddsd %xmm4, %xmm0, %xmm0 vunpckhpd %xmm2, %xmm2, %xmm4 vaddsd %xmm3, %xmm0, %xmm0 vunpckhpd %xmm1, %xmm1, %xmm3 vaddsd %xmm4, %xmm0, %xmm0 vaddsd %xmm2, %xmm0, %xmm0 vaddsd %xmm3, %xmm0, %xmm0 vaddsd %xmm0, %xmm1, %xmm0 vzeroupper ret when you enable -favoid-store-forwarding this is split as vmulpd a(,%rdi,8), %ymm1, %ymm1 vmovupd %ymm1, r(%rip) vmulpd a+32(,%rdi,8), %xmm0, %xmm0 vmovupd r+16(%rip), %ymm5 vmovapd %ymm5, -32(%rsp) vmovapd %xmm0, -16(%rsp) vmovapd -32(%rsp), %ymm6 vmovupd %xmm0, r+32(%rip) but that is even worse now, the offset load is still there and the two stack moves don't forward. So we replaced one with two STLF fails. Ugh.