http://gcc.gnu.org/bugzilla/show_bug.cgi?id=47167
--- Comment #5 from H.J. Lu <hjl.tools at gmail dot com> 2011-01-05 20:09:11 UTC --- (In reply to comment #3) > > this could be the reason for slowdown. > .... > > $ gcc -lm testcase2.s > $ time ./a.out > > real 0m4.239s > user 0m4.234s > sys 0m0.001s > > The important change is the change of %xmm10 -> %xmm0 in the mulpd > instruction. > The functionality of the test didn't change due to existing "movapd %xmm0, > %xmm10" at the top of the loop and added extra "movapd %xmm10, %xmm0" > before > the loop. > > This all happens on SnB, the code is generated with -O2 only. > > H.J., any ideas? Some loop performance is very sensitive to code sizes. This change - mulpd %xmm10, %xmm2 + mulpd %xmm0, %xmm2 will impact loop size due to exta REX prefix.