On Mon, 21 Mar 2005 13:45:19 +0100 (CET), Richard Guenther <[EMAIL PROTECTED]> wrote: > Hi! > > On mainline we now use loop versioning and peeling for alignment > for the following loop (-march=pentium4): > > void foo3(float * __restrict__ a, float * __restrict__ b, > float * __restrict__ c) > { > int i; > for (i=0; i<4; ++i) > a[i] = b[i] + c[i]; > } > > which results only in slower and larger code. I also cannot > see why we zero the mm registers before loading and why we > load them high/low separated: > > .L13: > xorps %xmm1, %xmm1 > movlps (%edx,%esi), %xmm1 > movhps 8(%edx,%esi), %xmm1 > xorps %xmm0, %xmm0 > movlps (%edx,%ebx), %xmm0 > movhps 8(%edx,%ebx), %xmm0 > addps %xmm0, %xmm1 > movaps %xmm1, (%edx,%eax) > addl $1, %ecx > addl $16, %edx > cmpl %ecx, -16(%ebp) > ja .L13 > > but the point is, there is nothing to win vectorizing the loop > in the first place if we do not know alignment before.
Uh, and with -funroll-loops we seem to be lost completely, as we produce peeling/loops for a eight times four rolling loop! Where is the information about the loop counter gone?? It looks like vectorization interacts badly with the rest of the loop optimizers. Ugh. Richard.