https://gcc.gnu.org/bugzilla/show_bug.cgi?id=71414
--- Comment #1 from Richard Biener <rguenth at gcc dot gnu.org> --- The core loop is .L8: addq $1, %rdx vaddps (%r8), %ymm1, %ymm1 addq $32, %r8 cmpq %rdx, %rcx ja .L8 which compared to LLVM is not unrolled. You can use -funroll-loops to force that which probably fixes the performance compared to LLVM. For the short loop above I also guess this is not the optimal IV choice.