https://gcc.gnu.org/bugzilla/show_bug.cgi?id=71414
Richard Biener <rguenth at gcc dot gnu.org> changed:
What |Removed |Added
----------------------------------------------------------------------------
Keywords| |missed-optimization
CC| |rguenth at gcc dot gnu.org
--- Comment #3 from Richard Biener <rguenth at gcc dot gnu.org> ---
(In reply to Marc Glisse from comment #2)
> (In reply to Richard Biener from comment #1)
> > The core loop is
> >
> > .L8:
> > addq $1, %rdx
> > vaddps (%r8), %ymm1, %ymm1
> > addq $32, %r8
> > cmpq %rdx, %rcx
> > ja .L8
> >
> > which compared to LLVM is not unrolled. You can use -funroll-loops to
> > force that which probably fixes the performance compared to LLVM. For
> > the short loop above I also guess this is not the optimal IV choice.
>
> -funroll-loops only gains 10% or so, nowhere near the factor of 2 with
> clang. Except for the slightly better induction choice in llvm, the 2
> unrolled loops look quite similar, I have a hard time seeing how one can be
> so much faster than the other. Maybe the alignment somehow ends up better in
> one case? Or the loop being one instruction shorter lets it fit better in
> cache?
Can you post a full example? The LLVM bug and this copy lacks information
on what actual 'a' and 'n' is used. Note that unless a fits in L2 I hardly
doubt one can exceeed memory bandwidth (and thus code-gen should not matter
unless it affects the HW prefetcher).