https://gcc.gnu.org/bugzilla/show_bug.cgi?id=71414

--- Comment #1 from Richard Biener <rguenth at gcc dot gnu.org> ---
The core loop is

.L8:
        addq    $1, %rdx
        vaddps  (%r8), %ymm1, %ymm1
        addq    $32, %r8
        cmpq    %rdx, %rcx
        ja      .L8

which compared to LLVM is not unrolled.  You can use -funroll-loops to
force that which probably fixes the performance compared to LLVM.  For
the short loop above I also guess this is not the optimal IV choice.

Reply via email to