Hi! On mainline we now use loop versioning and peeling for alignment for the following loop (-march=pentium4):
void foo3(float * __restrict__ a, float * __restrict__ b, float * __restrict__ c) { int i; for (i=0; i<4; ++i) a[i] = b[i] + c[i]; } which results only in slower and larger code. I also cannot see why we zero the mm registers before loading and why we load them high/low separated: .L13: xorps %xmm1, %xmm1 movlps (%edx,%esi), %xmm1 movhps 8(%edx,%esi), %xmm1 xorps %xmm0, %xmm0 movlps (%edx,%ebx), %xmm0 movhps 8(%edx,%ebx), %xmm0 addps %xmm0, %xmm1 movaps %xmm1, (%edx,%eax) addl $1, %ecx addl $16, %edx cmpl %ecx, -16(%ebp) ja .L13 but the point is, there is nothing to win vectorizing the loop in the first place if we do not know alignment before. Richard.