https://gcc.gnu.org/bugzilla/show_bug.cgi?id=71414

--- Comment #2 from Marc Glisse <glisse at gcc dot gnu.org> ---
(In reply to Richard Biener from comment #1)
> The core loop is
> 
> .L8:
>         addq    $1, %rdx
>         vaddps  (%r8), %ymm1, %ymm1
>         addq    $32, %r8
>         cmpq    %rdx, %rcx
>         ja      .L8
> 
> which compared to LLVM is not unrolled.  You can use -funroll-loops to
> force that which probably fixes the performance compared to LLVM.  For
> the short loop above I also guess this is not the optimal IV choice.

-funroll-loops only gains 10% or so, nowhere near the factor of 2 with clang.
Except for the slightly better induction choice in llvm, the 2 unrolled loops
look quite similar, I have a hard time seeing how one can be so much faster
than the other. Maybe the alignment somehow ends up better in one case? Or the
loop being one instruction shorter lets it fit better in cache?

Reply via email to