http://gcc.gnu.org/bugzilla/show_bug.cgi?id=47167

--- Comment #3 from Uros Bizjak <ubizjak at gmail dot com> 2011-01-05 19:30:49 
UTC ---
> this could be the reason for slowdown.

Hm, not really.

But, by changing the generated assembly around loop entry:

$ diff -u testcase2.s testcase2_.s
--- testcase2.s    2011-01-05 20:21:01.492919294 +0100
+++ testcase2_.s    2011-01-05 20:22:23.616577277 +0100
@@ -1678,6 +1678,7 @@
     addq    %r15, %rdx
     addq    $1, %rdi
     salq    $5, %rdi
+    movapd    %xmm10, %xmm0
     jmp    .L143
     .p2align 4,,10
     .p2align 3
@@ -1687,7 +1688,7 @@
     mulpd    %xmm2, %xmm6
     movapd    %xmm3, %xmm2
     movapd    %xmm10, (%rsi,%rcx)
-    mulpd    %xmm10, %xmm2
+    mulpd    %xmm0, %xmm2
     movsd    (%rdx), %xmm0
     movsd    8(%rdx), %xmm1
     subpd    %xmm6, %xmm2

The slowdown is magically fixed:

$ gcc -lm testcase2_.s
$ time ./a.out

real    0m4.041s
user    0m4.034s
sys    0m0.003s

versus:

$ gcc -lm testcase2.s
$ time ./a.out

real    0m4.239s
user    0m4.234s
sys    0m0.001s

The important change is the change of %xmm10 -> %xmm0 in the mulpd instruction.
The functionality of the test didn't change due to existing "movapd    %xmm0,
%xmm10" at the top of the loop and added extra "movapd    %xmm10, %xmm0" before
the loop.

This all happens on SnB, the code is generated with -O2 only.

H.J., any ideas?

Reply via email to