Hi!

On mainline we now use loop versioning and peeling for alignment
for the following loop (-march=pentium4):

void foo3(float * __restrict__ a, float * __restrict__ b,
          float * __restrict__ c)
{
        int i;
        for (i=0; i<4; ++i)
                a[i] = b[i] + c[i];
}

which results only in slower and larger code.  I also cannot
see why we zero the mm registers before loading and why we
load them high/low separated:

.L13:
        xorps   %xmm1, %xmm1
        movlps  (%edx,%esi), %xmm1
        movhps  8(%edx,%esi), %xmm1
        xorps   %xmm0, %xmm0
        movlps  (%edx,%ebx), %xmm0
        movhps  8(%edx,%ebx), %xmm0
        addps   %xmm0, %xmm1
        movaps  %xmm1, (%edx,%eax)
        addl    $1, %ecx
        addl    $16, %edx
        cmpl    %ecx, -16(%ebp)
        ja      .L13


but the point is, there is nothing to win vectorizing the loop
in the first place if we do not know alignment before.

Richard.

Reply via email to