http://gcc.gnu.org/bugzilla/show_bug.cgi?id=47167
--- Comment #3 from Uros Bizjak <ubizjak at gmail dot com> 2011-01-05 19:30:49
UTC ---
> this could be the reason for slowdown.
Hm, not really.
But, by changing the generated assembly around loop entry:
$ diff -u testcase2.s testcase2_.s
--- testcase2.s 2011-01-05 20:21:01.492919294 +0100
+++ testcase2_.s 2011-01-05 20:22:23.616577277 +0100
@@ -1678,6 +1678,7 @@
addq %r15, %rdx
addq $1, %rdi
salq $5, %rdi
+ movapd %xmm10, %xmm0
jmp .L143
.p2align 4,,10
.p2align 3
@@ -1687,7 +1688,7 @@
mulpd %xmm2, %xmm6
movapd %xmm3, %xmm2
movapd %xmm10, (%rsi,%rcx)
- mulpd %xmm10, %xmm2
+ mulpd %xmm0, %xmm2
movsd (%rdx), %xmm0
movsd 8(%rdx), %xmm1
subpd %xmm6, %xmm2
The slowdown is magically fixed:
$ gcc -lm testcase2_.s
$ time ./a.out
real 0m4.041s
user 0m4.034s
sys 0m0.003s
versus:
$ gcc -lm testcase2.s
$ time ./a.out
real 0m4.239s
user 0m4.234s
sys 0m0.001s
The important change is the change of %xmm10 -> %xmm0 in the mulpd instruction.
The functionality of the test didn't change due to existing "movapd %xmm0,
%xmm10" at the top of the loop and added extra "movapd %xmm10, %xmm0" before
the loop.
This all happens on SnB, the code is generated with -O2 only.
H.J., any ideas?