http://gcc.gnu.org/bugzilla/show_bug.cgi?id=47167
--- Comment #3 from Uros Bizjak <ubizjak at gmail dot com> 2011-01-05 19:30:49 UTC --- > this could be the reason for slowdown. Hm, not really. But, by changing the generated assembly around loop entry: $ diff -u testcase2.s testcase2_.s --- testcase2.s 2011-01-05 20:21:01.492919294 +0100 +++ testcase2_.s 2011-01-05 20:22:23.616577277 +0100 @@ -1678,6 +1678,7 @@ addq %r15, %rdx addq $1, %rdi salq $5, %rdi + movapd %xmm10, %xmm0 jmp .L143 .p2align 4,,10 .p2align 3 @@ -1687,7 +1688,7 @@ mulpd %xmm2, %xmm6 movapd %xmm3, %xmm2 movapd %xmm10, (%rsi,%rcx) - mulpd %xmm10, %xmm2 + mulpd %xmm0, %xmm2 movsd (%rdx), %xmm0 movsd 8(%rdx), %xmm1 subpd %xmm6, %xmm2 The slowdown is magically fixed: $ gcc -lm testcase2_.s $ time ./a.out real 0m4.041s user 0m4.034s sys 0m0.003s versus: $ gcc -lm testcase2.s $ time ./a.out real 0m4.239s user 0m4.234s sys 0m0.001s The important change is the change of %xmm10 -> %xmm0 in the mulpd instruction. The functionality of the test didn't change due to existing "movapd %xmm0, %xmm10" at the top of the loop and added extra "movapd %xmm10, %xmm0" before the loop. This all happens on SnB, the code is generated with -O2 only. H.J., any ideas?