Useless vectorization of small loops

Richard Guenther Mon, 21 Mar 2005 04:45:37 -0800

Hi!

On mainline we now use loop versioning and peeling for alignment
for the following loop (-march=pentium4):


void foo3(float * __restrict__ a, float * __restrict__ b,
          float * __restrict__ c)
{
        int i;
        for (i=0; i<4; ++i)
                a[i] = b[i] + c[i];
}

which results only in slower and larger code.  I also cannot
see why we zero the mm registers before loading and why we
load them high/low separated:

.L13:
        xorps   %xmm1, %xmm1
        movlps  (%edx,%esi), %xmm1
        movhps  8(%edx,%esi), %xmm1
        xorps   %xmm0, %xmm0
        movlps  (%edx,%ebx), %xmm0
        movhps  8(%edx,%ebx), %xmm0
        addps   %xmm0, %xmm1
        movaps  %xmm1, (%edx,%eax)
        addl    $1, %ecx
        addl    $16, %edx
        cmpl    %ecx, -16(%ebp)
        ja      .L13


but the point is, there is nothing to win vectorizing the loop
in the first place if we do not know alignment before.

Richard.

Useless vectorization of small loops

Reply via email to