------- Comment #11 from ubizjak at gmail dot com 2007-06-28 11:39 ------- (In reply to comment #10)
> ;; Not peeling loop completely, rolls too much (8 iterations > 8 [maximum > peelings]) This is meant that original + 8 unroll iterations > 8. So, loop has 46 insns, and 9 copies of loops is more than PARAM_MAX_COMPLETELY_PEELED_INSNS (currently 400) and unroll is rejeceted. However, even with unrolled vectorized loop, we are still 50% slower. It looks that tight sequences of subsd/subpd and mulsd/mulpd kill performance in -ftree-vectorize: movapd %xmm6, %xmm0 movsd %xmm1, -200(%ebp) subsd %xmm5, %xmm0 subpd (%ebx), %xmm3 mulsd %xmm0, %xmm0 mulpd %xmm3, %xmm3 haddpd %xmm3, %xmm3 movapd %xmm3, %xmm2 movsd w2gauss.1408+8, %xmm3 addsd %xmm2, %xmm0 mulsd w1gauss.1411-8(,%eax,8), %xmm3 sqrtsd %xmm0, %xmm1 It looks that there is no other help but -fvect-cost-model. The results for induct.f90 (gfortran -march=nocona -msse3 -O3 -ffast-math -mfpmath=sse -funroll-loops) are: induct.f90, -ftree-vectorize without -fvect-cost-model: user 1m34.046s induct.f90, -ftree-vectorize with -fvect-cost-model: user 0m45.447s induct.f90 without -ftree-vectorize: user 0m45.215s -- http://gcc.gnu.org/bugzilla/show_bug.cgi?id=32084