[Bug other/71414] 2x slower than clang summing small float array

rguenth at gcc dot gnu.org Mon, 06 Jun 2016 02:38:52 -0700

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=71414


Richard Biener <rguenth at gcc dot gnu.org> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
           Keywords|                            |missed-optimization
                 CC|                            |rguenth at gcc dot gnu.org

--- Comment #3 from Richard Biener <rguenth at gcc dot gnu.org> ---
(In reply to Marc Glisse from comment #2)
> (In reply to Richard Biener from comment #1)
> > The core loop is
> > 
> > .L8:
> >         addq    $1, %rdx
> >         vaddps  (%r8), %ymm1, %ymm1
> >         addq    $32, %r8
> >         cmpq    %rdx, %rcx
> >         ja      .L8
> > 
> > which compared to LLVM is not unrolled.  You can use -funroll-loops to
> > force that which probably fixes the performance compared to LLVM.  For
> > the short loop above I also guess this is not the optimal IV choice.
> 
> -funroll-loops only gains 10% or so, nowhere near the factor of 2 with
> clang. Except for the slightly better induction choice in llvm, the 2
> unrolled loops look quite similar, I have a hard time seeing how one can be
> so much faster than the other. Maybe the alignment somehow ends up better in
> one case? Or the loop being one instruction shorter lets it fit better in
> cache?

Can you post a full example?  The LLVM bug and this copy lacks information
on what actual 'a' and 'n' is used.  Note that unless a fits in L2 I hardly
doubt one can exceeed memory bandwidth (and thus code-gen should not matter
unless it affects the HW prefetcher).

[Bug other/71414] 2x slower than clang summing small float array

Reply via email to