https://gcc.gnu.org/bugzilla/show_bug.cgi?id=65492
--- Comment #9 from Allan Jensen <linux at carewolf dot com> --- Looking at the assembler, it does indeed appear that the only difference just loop unrolling and if conversion. After testing on another machine (and old PhenomII as opposed to the Sandybridge), and report that disabling tree-loop-if-convert directly or indirectly via tree-loop-vectorize -O3 regains all of the speed difference to -O2 on PhenomII. My guess is that the small loop-unrolling is conflicting with op-cache Intel introduced in the SandyBridge and newer architectures which speeds up small tight loops. On architectures without op-cache the loop-unrolling is probably still slightly faster. Unfortunately, using -mtune=sandybridge does not improve the situation, so maybe there should be some architecture tuning on even trivial loop unrolling, and possibly discussion on making it part of generic-x64 tuning.