https://gcc.gnu.org/bugzilla/show_bug.cgi?id=65492

--- Comment #9 from Allan Jensen <linux at carewolf dot com> ---
Looking at the assembler, it does indeed appear that the only difference just
loop unrolling and if conversion. 

After testing on another machine (and old PhenomII as opposed to the
Sandybridge), and report that disabling tree-loop-if-convert directly or
indirectly via tree-loop-vectorize -O3 regains all of the speed difference to
-O2 on PhenomII.

My guess is that the small loop-unrolling is conflicting with op-cache Intel
introduced in the SandyBridge and newer architectures which speeds up small
tight loops. On architectures without op-cache the loop-unrolling is probably
still slightly faster.

Unfortunately, using -mtune=sandybridge does not improve the situation, so
maybe there should be some architecture tuning on even trivial loop unrolling,
and possibly discussion on making it part of generic-x64 tuning.

Reply via email to