------- Comment #17 from xuepeng dot guo at intel dot com 2009-02-09 09:16 ------- Below is a loop in the case in its original form(compiled by GCC 4.4):
_Z7bench_1PfS_fj: .LFB2309: shrl $2, %edx shufps $0, %xmm0, %xmm0 subl $1, %edx xorl %eax, %eax addq $1, %rdx salq $4, %rdx .p2align 4,,10 .p2align 3 .L11: movaps %xmm0, %xmm1 addps (%rsi,%rax), %xmm1 movaps %xmm1, (%rdi,%rax) addq $16, %rax cmpq %rdx, %rax jne .L11 rep ret The time is: [xg...@shgcc-10 38824]$ g++ 44.s -o orig.out [xg...@shgcc-10 38824]$ time ./orig.out real 0m1.878s user 0m1.877s sys 0m0.000s [xg...@shgcc-10 38824]$ time ./orig.out real 0m1.879s user 0m1.879s sys 0m0.001s [xg...@shgcc-10 38824]$ time ./orig.out real 0m1.873s user 0m1.872s sys 0m0.001s After adding two nop: .L11: movaps %xmm0, %xmm1 nop nop addps (%rsi,%rax), %xmm1 movaps %xmm1, (%rdi,%rax) addq $16, %rax cmpq %rdx, %rax jne .L11 rep ret The time is: [xg...@shgcc-10 38824]$ g++ 44.s -o 2nop.out [xg...@shgcc-10 38824]$ time ./2nop.out real 0m1.762s user 0m1.762s sys 0m0.000s [xg...@shgcc-10 38824]$ time ./2nop.out real 0m1.762s user 0m1.762s sys 0m0.000s [xg...@shgcc-10 38824]$ time ./2nop.out real 0m1.762s user 0m1.761s sys 0m0.000s I suspect that the code layout maybe hurt the performance. -- http://gcc.gnu.org/bugzilla/show_bug.cgi?id=38824