https://gcc.gnu.org/bugzilla/show_bug.cgi?id=71414
Richard Biener <rguenth at gcc dot gnu.org> changed: What |Removed |Added ---------------------------------------------------------------------------- Keywords| |missed-optimization CC| |rguenth at gcc dot gnu.org --- Comment #3 from Richard Biener <rguenth at gcc dot gnu.org> --- (In reply to Marc Glisse from comment #2) > (In reply to Richard Biener from comment #1) > > The core loop is > > > > .L8: > > addq $1, %rdx > > vaddps (%r8), %ymm1, %ymm1 > > addq $32, %r8 > > cmpq %rdx, %rcx > > ja .L8 > > > > which compared to LLVM is not unrolled. You can use -funroll-loops to > > force that which probably fixes the performance compared to LLVM. For > > the short loop above I also guess this is not the optimal IV choice. > > -funroll-loops only gains 10% or so, nowhere near the factor of 2 with > clang. Except for the slightly better induction choice in llvm, the 2 > unrolled loops look quite similar, I have a hard time seeing how one can be > so much faster than the other. Maybe the alignment somehow ends up better in > one case? Or the loop being one instruction shorter lets it fit better in > cache? Can you post a full example? The LLVM bug and this copy lacks information on what actual 'a' and 'n' is used. Note that unless a fits in L2 I hardly doubt one can exceeed memory bandwidth (and thus code-gen should not matter unless it affects the HW prefetcher).