------- Comment #11 from ubizjak at gmail dot com  2007-06-28 11:39 -------
(In reply to comment #10)

> ;; Not peeling loop completely, rolls too much (8 iterations > 8 [maximum
> peelings])

This is meant that original + 8 unroll iterations > 8. So, loop has 46 insns,
and 9 copies of loops is more than PARAM_MAX_COMPLETELY_PEELED_INSNS (currently
400) and unroll is rejeceted.

However, even with unrolled vectorized loop, we are still 50% slower. It looks
that tight sequences of subsd/subpd and mulsd/mulpd kill performance in
-ftree-vectorize:

        movapd  %xmm6, %xmm0
        movsd   %xmm1, -200(%ebp)
        subsd   %xmm5, %xmm0
        subpd   (%ebx), %xmm3
        mulsd   %xmm0, %xmm0
        mulpd   %xmm3, %xmm3
        haddpd  %xmm3, %xmm3
        movapd  %xmm3, %xmm2
        movsd   w2gauss.1408+8, %xmm3
        addsd   %xmm2, %xmm0
        mulsd   w1gauss.1411-8(,%eax,8), %xmm3
        sqrtsd  %xmm0, %xmm1

It looks that there is no other help but -fvect-cost-model. The results for
induct.f90 (gfortran -march=nocona -msse3 -O3 -ffast-math -mfpmath=sse
-funroll-loops) are:

induct.f90, -ftree-vectorize without -fvect-cost-model:
user    1m34.046s

induct.f90, -ftree-vectorize with -fvect-cost-model:
user    0m45.447s

induct.f90 without -ftree-vectorize:
user    0m45.215s


-- 


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=32084

Reply via email to