On Mon, 21 Mar 2005 13:45:19 +0100 (CET), Richard Guenther
<[EMAIL PROTECTED]> wrote:
> Hi!
> 
> On mainline we now use loop versioning and peeling for alignment
> for the following loop (-march=pentium4):
> 
> void foo3(float * __restrict__ a, float * __restrict__ b,
>           float * __restrict__ c)
> {
>         int i;
>         for (i=0; i<4; ++i)
>                 a[i] = b[i] + c[i];
> }
> 
> which results only in slower and larger code.  I also cannot
> see why we zero the mm registers before loading and why we
> load them high/low separated:
> 
> .L13:
>         xorps   %xmm1, %xmm1
>         movlps  (%edx,%esi), %xmm1
>         movhps  8(%edx,%esi), %xmm1
>         xorps   %xmm0, %xmm0
>         movlps  (%edx,%ebx), %xmm0
>         movhps  8(%edx,%ebx), %xmm0
>         addps   %xmm0, %xmm1
>         movaps  %xmm1, (%edx,%eax)
>         addl    $1, %ecx
>         addl    $16, %edx
>         cmpl    %ecx, -16(%ebp)
>         ja      .L13
> 
> but the point is, there is nothing to win vectorizing the loop
> in the first place if we do not know alignment before.

Uh, and with -funroll-loops we seem to be lost completely, as we
produce peeling/loops for a eight times four rolling loop!  Where is
the information about the loop counter gone??

It looks like vectorization interacts badly with the rest of the loop
optimizers.

Ugh.

Richard.

Reply via email to