http://gcc.gnu.org/bugzilla/show_bug.cgi?id=54000
Steven Bosscher <steven at gcc dot gnu.org> changed: What |Removed |Added ---------------------------------------------------------------------------- CC| |rguenth at gcc dot gnu.org Summary|[4.6/4.7/4.8 |[4.6/4.7/4.8 Regression] |Regression][IVOPTS] |Performance breakdown for |Performance breakdown for |gcc-4.{6,7} vs. gcc-4.5 |gcc-4.{6,7} vs. gcc-4.5 |using std::vector in matrix |using std::vector in matrix |vector multiplication |vector multiplication |(IVopts / inliner) --- Comment #9 from Steven Bosscher <steven at gcc dot gnu.org> 2013-02-22 23:40:35 UTC --- (In reply to comment #8) > Thanks for the reduced testcase. The innermost loops compare as follows: > > 4.5: > > .L7: > movsd (%rbx,%rcx), %xmm0 > addq $8, %rcx > mulsd 0(%rbp,%rdx), %xmm0 > addq $8, %rdx > cmpq $24, %rdx > addsd %xmm0, %xmm1 > movsd %xmm1, (%rsi) > jne .L7 4.8 r196182 with "--param early-inlining-insns=2" (2 x the default value): .L13: movsd (%rdx), %xmm0 addq $8, %rdx mulsd (%rsi,%rax), %xmm0 addq $8, %rax cmpq $24, %rax addsd %xmm0, %xmm1 movsd %xmm1, 8(%rdi,%rcx) jne .L13 > > 4.7: > > .L13: > movq 64(%rsp), %rdi > movq 80(%rsp), %rdx > addq %rcx, %rdi > addq %r8, %rdx > movsd -8(%rax,%rdi), %xmm0 > mulsd (%rsi,%rax), %xmm0 > addq $8, %rax > cmpq $24, %rax > addsd (%rdx), %xmm0 > movsd %xmm0, (%rdx) > jne .L13 This is similar to what 4.8 r196182 produces without inliner tweaks: .L18: movq %rcx, %rdi addq 64(%rsp), %rdi movq %r8, %rdx addq 80(%rsp), %rdx movsd -8(%rax,%rdi), %xmm0 mulsd (%rsi,%rax), %xmm0 addq $8, %rax cmpq $24, %rax addsd (%rdx), %xmm0 movsd %xmm0, (%rdx) jne .L18 > so we seem to have a register allocation / spilling issue here as well > as a bad induction variable choice. GCC 4.8 is not any better here. All true, but in the end it looks like an inliner heuristics issue first (as also suggested by comment #3).