------- Comment #58 from bonzini at gnu dot org 2009-05-06 09:56 ------- Uhm, it's better to run unpatched 4.5 with -O1 -fforward-propagate to get a fair comparison. Also, I was counting the loop headers, which are not part of the hot code.
4.2 -O1 4.5 -O1 -ffw-prop 4.5 + patch -O1 LOOP 1 181 201 180 INNER LOOP 1.1 117 118 113 LOOP 2 27 27 26 This shows that you should compare running the code (you can use direct.i) with 4.2/-O1 and 4.5/-O1 -fforward-propagate. This is very important, otherwise you're comparing apples to oranges. fwprop is creating too high register pressure by creating offsets like these in the loop header: leaq -8(%r12), %rsi leaq 8(%r12), %r10 leaq -16(%r12), %r9 leaq -24(%r12), %rbx leaq -32(%r12), %rbp leaq -40(%r12), %rdi leaq -48(%r12), %r11 leaq 40(%r12), %rdx Then, the additional register pressure is causing the bad scheduling we have in the fast assembly outputs: movq (%rdx), %rax movsd (%rax,%r15,2), %xmm7 movq (%rdi), %r15 movsd (%rax,%r15,2), %xmm10 movq (%rbp), %r15 movsd (%rax,%r15,2), %xmm5 movq (%rbx), %r15 movsd (%rax,%r15,2), %xmm6 movq (%r9), %r15 movsd (%rax,%r15,2), %xmm15 movq (%rsi), %r15 movsd (%rax,%r15,2), %xmm11 -- http://gcc.gnu.org/bugzilla/show_bug.cgi?id=33928