------- Comment #11 from ubizjak at gmail dot com 2007-06-28 11:39 -------
(In reply to comment #10)
> ;; Not peeling loop completely, rolls too much (8 iterations > 8 [maximum
> peelings])
This is meant that original + 8 unroll iterations > 8. So, loop has 46 insns,
and 9 copies of loops is more than PARAM_MAX_COMPLETELY_PEELED_INSNS (currently
400) and unroll is rejeceted.
However, even with unrolled vectorized loop, we are still 50% slower. It looks
that tight sequences of subsd/subpd and mulsd/mulpd kill performance in
-ftree-vectorize:
movapd %xmm6, %xmm0
movsd %xmm1, -200(%ebp)
subsd %xmm5, %xmm0
subpd (%ebx), %xmm3
mulsd %xmm0, %xmm0
mulpd %xmm3, %xmm3
haddpd %xmm3, %xmm3
movapd %xmm3, %xmm2
movsd w2gauss.1408+8, %xmm3
addsd %xmm2, %xmm0
mulsd w1gauss.1411-8(,%eax,8), %xmm3
sqrtsd %xmm0, %xmm1
It looks that there is no other help but -fvect-cost-model. The results for
induct.f90 (gfortran -march=nocona -msse3 -O3 -ffast-math -mfpmath=sse
-funroll-loops) are:
induct.f90, -ftree-vectorize without -fvect-cost-model:
user 1m34.046s
induct.f90, -ftree-vectorize with -fvect-cost-model:
user 0m45.447s
induct.f90 without -ftree-vectorize:
user 0m45.215s
--
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=32084